From 1b6c6495eb320ffe491ff462c465c42fd18cf911 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Mon, 24 Feb 2025 09:08:07 +0800 Subject: [PATCH] Automated deployment @ 2025-02-24 09:08:07 Asia/Taipei --- README.md | 17796 ++++++++--------- __pycache__/config.cpython-310.pyc | Bin 1079 -> 1079 bytes __pycache__/util4translation.cpython-310.pyc | Bin 9439 -> 9439 bytes database/logs/runtime.log | 4 + database/storage/storage_2025-02-24.md | 10150 ++++++++++ docs/index.md | 17796 ++++++++--------- 6 files changed, 27950 insertions(+), 17796 deletions(-) create mode 100644 database/storage/storage_2025-02-24.md diff --git a/README.md b/README.md index 3d4c3c9b19..daca996fb7 100644 --- a/README.md +++ b/README.md @@ -1,4474 +1,3751 @@ # arxiv-daily - Automated deployment @ 2025-02-23 20:25:02 Asia/Taipei + Automated deployment @ 2025-02-24 09:08:05 Asia/Taipei > Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml). > You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage). ## AI -### Medical explainable AI +### LLM |Publish Date|Title|Authors|Homepage|Code| | :---: | :---: | :---: | :---: | :---: | -|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| -|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| -|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| -|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| -|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null| -|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)| -|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null| -|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null| -|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v2](http://arxiv.org/abs/2501.09628v2)|null| -|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null| -|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null| -|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null| -|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null| -|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null| -|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)| -|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null| -|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null| -|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null| -|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null| -|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null| -|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null| -|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null| -|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null| -|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null| -|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)| -|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null| -|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null| -|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null| -|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)| -|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)| -|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null| -|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)| -|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null| -|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null| -|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null| -|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null| -|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null| -|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null| -|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null| -|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null| -|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare**|Antonio Rago et.al.|[2408.17401v2](http://arxiv.org/abs/2408.17401v2)|null| -|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null| -|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null| -|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null| -|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null| -|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null| -|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null| -|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null| -|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null| -|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null| -|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null| -|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null| -|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null| -|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null| -|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null| -|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)| -|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null| -|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null| -|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null| -|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null| -|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null| -|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null| -|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)| -|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null| -|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null| -|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null| -|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null| -|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null| -|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)| -|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null| -|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)| -|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null| -|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null| -|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null| -|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null| -|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null| -|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null| -|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null| -|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null| -|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null| -|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null| -|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null| -|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null| -|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)| -|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null| -|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null| -|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null| -|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)| -|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)| -|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null| -|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null| -|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null| -|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null| -|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null| -|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null| -|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null| -|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)| -|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null| -|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null| -|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null| - -#### Abstracts -##### **Towards a perturbation-based explanation for medical AI as differentiable programs** -2502.14001v1 by Takeshi Abe, Yoshiyuki Asai - -Recent advancement in machine learning algorithms reaches a point where -medical devices can be equipped with artificial intelligence (AI) models for -diagnostic support and routine automation in clinical settings. In medicine and -healthcare, there is a particular demand for sufficient and objective -explainability of the outcome generated by AI models. However, AI models are -generally considered as black boxes due to their complexity, and the -computational process leading to their response is often opaque. Although -several methods have been proposed to explain the behavior of models by -evaluating the importance of each feature in discrimination and prediction, -they may suffer from biases and opacities arising from the scale and sampling -protocol of the dataset used for training or testing. To overcome the -shortcomings of existing methods, we explore an alternative approach to provide -an objective explanation of AI models that can be defined independently of the -learning process and does not require additional data. As a preliminary study -for this direction of research, this work examines a numerical availability of -the Jacobian matrix of deep learning models that measures how stably a model -responses against small perturbations added to the input. The indicator, if -available, are calculated from a trained AI model for a given target input. -This is a first step towards a perturbation-based explanation, which will -assist medical practitioners in understanding and interpreting the response of -the AI model in its clinical application. - -摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 - -##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** -2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker - -Explainability remains a significant problem for AI models in medical -imaging, making it challenging for clinicians to trust AI-driven predictions. -We introduce 3D ReX, the first causality-based post-hoc explainability tool for -3D models. 3D ReX uses the theory of actual causality to generate -responsibility maps which highlight the regions most crucial to the model's -decision. We test 3D ReX on a stroke detection model, providing insight into -the spatial distribution of features relevant to stroke. - -摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 -我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 - -##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** -2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano - -This paper presents a complete explainable system that interprets a set of -data, abstracts the underlying features and describes them in a natural -language of choice. The system relies on two crucial stages: (i) identifying -emerging properties from data and transforming them into abstract concepts, and -(ii) converting these concepts into natural language. Despite the impressive -natural language generation capabilities demonstrated by Large Language Models, -their statistical nature and the intricacy of their internal mechanism still -force us to employ these techniques as black boxes, forgoing trustworthiness. -Developing an explainable pipeline for data interpretation would allow -facilitating its use in safety-critical environments like processing medical -information and allowing non-experts and visually impaired people to access -narrated information. To this end, we believe that the fields of knowledge -representation and automated reasoning research could present a valid -alternative. Expanding on prior research that tackled the first stage (i), we -focus on the second stage, named Concept2Text. Being explainable, data -translation is easily modeled through logic-based rules, once again emphasizing -the role of declarative programming in achieving AI explainability. This paper -explores a Prolog/CLP-based rewriting system to interpret concepts-articulated -in terms of classes and relations, plus common knowledge-derived from a generic -ontology, generating natural language text. Its main features include -hierarchical tree rewritings, modular multilingual generation, support for -equivalent variants across semantic, grammar, and lexical levels, and a -transparent rule-based system. We outline the architecture and demonstrate its -flexibility through some examples capable of generating numerous diverse and -equivalent rewritings based on the input concept. - -摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 - -##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** -2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek - -We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), -an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS -predicts future PHTs using transformer-based architectures. The Adaptive Risk -Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk -probabilities for clinician-defined critical events. ARES incorporates a -personalized explainability module that identifies key clinical factors -influencing risk estimates for individual patients. ARES was evaluated on the -MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its -performance against traditional early warning systems and machine learning -models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, -with 60% including hospital admissions. The dataset contained over 357 million -tokens. ETHOS outperformed benchmark models in predicting hospital admissions, -ICU admissions, and prolonged hospital stays, achieving superior AUC scores. -ETHOS-based risk estimates demonstrated robustness across demographic subgroups -with strong model reliability, confirmed via calibration curves. The -personalized explainability module provides insights into patient-specific -factors contributing to risk. ARES, powered by ETHOS, advances predictive -healthcare AI by providing dynamic, real-time, and personalized risk estimation -with patient-specific explainability to enhance clinician trust. Its -adaptability and superior accuracy position it as a transformative tool for -clinical decision-making, potentially improving patient outcomes and resource -allocation in emergency and inpatient settings. We release the full code at -github.com/ipolharvard/ethos-ares to facilitate future research. - -摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), -一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS -使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 - -##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases** -2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq - -This study addresses a critical gap in the healthcare system by developing a -clinically meaningful, practical, and explainable disease surveillance system -for multiple chronic diseases, utilizing routine EHR data from multiple U.S. -practices integrated with CureMD's EMR/EHR system. Unlike traditional -systems--using AI models that rely on features from patients' labs--our -approach focuses on routinely available data, such as medical history, vitals, -diagnoses, and medications, to preemptively assess the risks of chronic -diseases in the next year. We trained three distinct models for each chronic -disease: prediction models that forecast the risk of a disease 3, 6, and 12 -months before a potential diagnosis. We developed Random Forest models, which -were internally validated using F1 scores and AUROC as performance metrics and -further evaluated by a panel of expert physicians for clinical relevance based -on inferences grounded in medical knowledge. Additionally, we discuss our -implementation of integrating these models into a practical EMR system. Beyond -using Shapley attributes and surrogate models for explainability, we also -introduce a new rule-engineering framework to enhance the intrinsic -explainability of Random Forests. - -摘要:本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統,來解決醫療保健系統中的重大缺口,利用整合 CureMD 的 EMR/EHR 系統,來自多個美國實務的例行 EHR 資料。與傳統系統不同的是,我們的做法著重在例行可得的資料,例如病歷、生命徵象、診斷和藥物,以預先評估未來一年慢性疾病的風險,而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型:預測模型,用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型,並使用 F1 分數和 AUROC 作為效能指標,進行內部驗證,並進一步由專家醫師小組根據植基於醫學知識的推論,評估其臨床相關性。此外,我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外,我們還引進了一個新的規則工程架構,以增強隨機森林的內在可解釋性。 - -##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data** -2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek - -Deep neural networks are increasingly employed in high-stakes medical -applications, despite their tendency for shortcut learning in the presence of -spurious correlations, which can have potentially fatal consequences in -practice. Detecting and mitigating shortcut behavior is a challenging task that -often requires significant labeling efforts from domain experts. To alleviate -this problem, we introduce a semi-automated framework for the identification of -spurious behavior from both data and model perspective by leveraging insights -from eXplainable Artificial Intelligence (XAI). This allows the retrieval of -spurious data points and the detection of model circuits that encode the -associated prediction rules. Moreover, we demonstrate how these shortcut -encodings can be used for XAI-based sample- and pixel-level data annotation, -providing valuable information for bias mitigation methods to unlearn the -undesired shortcut behavior. We show the applicability of our framework using -four medical datasets across two modalities, featuring controlled and -real-world spurious correlations caused by data artifacts. We successfully -identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision -Transformer models, ultimately increasing their robustness and applicability -for real-world medical tasks. - -摘要:深度神经网络越来越多地用于高风险医疗应用中,尽管它们在存在虚假相关性的情况下倾向于捷径学习,这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务,通常需要领域专家的大量标记工作。为了缓解这个问题,我们引入了一个半自动框架,用于从数据和模型的角度识别虚假行为,方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外,我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释,为偏差缓解方法提供有价值的信息,以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性,这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差,最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。 - -##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model** -2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail - -Suicidal ideation detection is crucial for preventing suicides, a leading -cause of death worldwide. Many individuals express suicidal thoughts on social -media, offering a vital opportunity for early detection through advanced -machine learning techniques. The identification of suicidal ideation in social -media text is improved by utilising a hybrid framework that integrates -Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory -(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability -of the model's predictions, Explainable AI (XAI) methods are applied, with a -particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At -first, the model managed to reach an accuracy of 92.81%. By applying -fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The -SHAP analysis revealed key features influencing the model's predictions, such -as terms related to mental health struggles. This level of transparency boosts -the model's credibility while helping mental health professionals understand -and trust the predictions. This work highlights the potential for improving the -accuracy and interpretability of detecting suicidal tendencies, making a -valuable contribution to the progress of mental health monitoring systems. It -emphasizes the significance of blending powerful machine learning methods with -explainability to develop reliable and impactful mental health solutions. - -摘要:自殺意念偵測對於預防自殺至關重要,而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭,這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構,並加入注意力機制,可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性,我們採用可解釋人工智慧 (XAI) 方法,特別著重於 SHapley 加法解釋 (SHAP)。一開始,模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術,準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵,例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度,同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力,為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。 - -##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights** -2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet - -In epidemiology, traditional statistical methods such as logistic regression, -linear regression, and other parametric models are commonly employed to -investigate associations between predictors and health outcomes. However, -non-parametric machine learning techniques, such as deep neural networks -(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for -this task. Despite their potential, these methods face challenges due to the -limited availability of high-quality, high-quantity data in this field. To -address these challenges, we introduce SEANN, a novel approach for informed -DNNs that leverages a prevalent form of domain-specific knowledge: Pooled -Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies, -in different forms, and represent a quantitative form of a scientific -consensus. By direct integration within the learning procedure using a custom -loss, we experimentally demonstrate significant improvements in the -generalizability of predictive performances and the scientific plausibility of -extracted relationships compared to a domain-knowledge agnostic neural network -in a scarce and noisy data setting. - -摘要:在流行病學中,傳統的統計方法,例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而,非參數機器學習技術,例如深度神經網路 (DNN),結合可解釋的 AI (XAI) 工具,為這項任務提供了新的機會。儘管這些方法具有潛力,但由於該領域缺乏高品質、高數量資料,因此這些方法面臨挑戰。為了應對這些挑戰,我們引入了 SEANN,這是一種新穎的方法,用於獲取知識的 DNN,它利用了一種流行的領域特定知識形式:彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中,並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中,我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比,科學合理性的顯著提升,且是在稀少且有雜訊的資料設定中。 - -##### **Artificial Intelligence-Driven Clinical Decision Support Systems** -2501.09628v2 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni - -As artificial intelligence (AI) becomes increasingly embedded in healthcare -delivery, this chapter explores the critical aspects of developing reliable and -ethical Clinical Decision Support Systems (CDSS). Beginning with the -fundamental transition from traditional statistical models to sophisticated -machine learning approaches, this work examines rigorous validation strategies -and performance assessment methods, including the crucial role of model -calibration and decision curve analysis. The chapter emphasizes that creating -trustworthy AI systems in healthcare requires more than just technical -accuracy; it demands careful consideration of fairness, explainability, and -privacy. The challenge of ensuring equitable healthcare delivery through AI is -stressed, discussing methods to identify and mitigate bias in clinical -predictive models. The chapter then delves into explainability as a cornerstone -of human-centered CDSS. This focus reflects the understanding that healthcare -professionals must not only trust AI recommendations but also comprehend their -underlying reasoning. The discussion advances in an analysis of privacy -vulnerabilities in medical AI systems, from data leakage in deep learning -models to sophisticated attacks against model explanations. The text explores -privacy-preservation strategies such as differential privacy and federated -learning, while acknowledging the inherent trade-offs between privacy -protection and model performance. This progression, from technical validation -to ethical considerations, reflects the multifaceted challenges of developing -AI systems that can be seamlessly and reliably integrated into daily clinical -practice while maintaining the highest standards of patient care and data -protection. - -摘要:隨著人工智慧(AI)在醫療保健服務中日益普及,本章探討了開發可靠且符合道德的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型轉變到複雜機器學習方法的基本原理開始,這項工作探討了嚴謹的驗證策略和效能評估方法,包括模型校準和決策曲線分析的關鍵角色。本章強調,在醫療保健中建立值得信賴的 AI 系統不僅需要技術準確性;它需要仔細考量公平性、可解釋性和隱私。本章強調了透過 AI 確保公平醫療保健服務的挑戰,並討論了識別和減輕臨床預測模型中偏差的方法。接著,本章深入探討可解釋性作為以人為中心的 CDSS 的基石。這種關注反映了對醫療保健專業人員不僅必須信任 AI 建議,還必須理解其背後推理的理解。討論進展到對醫療 AI 系統中隱私漏洞的分析,從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略,例如差分隱私和聯合學習,同時承認隱私保護和模型效能之間的固有權衡。從技術驗證到道德考量,這種進展反映了開發 AI 系統的多方面挑戰,這些系統可以無縫且可靠地整合到日常臨床實務中,同時維持最高標準的患者照護和資料保護。 - -##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis** -2501.06887v1 by Sadia Kamal, Tim Oates - -As deep learning models gain attraction in medical data, ensuring transparent -and trustworthy decision-making is essential. In skin cancer diagnosis, while -advancements in lesion detection and classification have improved accuracy, the -black-box nature of these methods poses challenges in understanding their -decision processes, leading to trust issues among physicians. This study -leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on -different skin lesion datasets, to capture meaningful relationships between -visual features and diagnostic criteria terms. To further enhance transparency, -we propose a method called MedGrad E-CLIP, which builds on gradient-based -E-CLIP by incorporating a weighted entropy mechanism designed for complex -medical imaging like skin lesions. This approach highlights critical image -regions linked to specific diagnostic descriptions. The developed integrated -pipeline not only classifies skin lesions by matching corresponding -descriptions but also adds an essential layer of explainability developed -especially for medical data. By visually explaining how different features in -an image relates to diagnostic criteria, this approach demonstrates the -potential of advanced vision-language models in medical image analysis, -ultimately improving transparency, robustness, and trust in AI-driven -diagnostic systems. - -摘要:随着深度学习模型在医学数据中获得关注,确保透明且值得信赖的决策至关重要。在皮肤癌诊断中,虽然病灶检测和分类的进步提高了准确性,但这些方法的黑盒性质对理解其决策过程构成了挑战,导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP(对比语言图像预训练)模型,以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度,我们提出了一种名为 MedGrad E-CLIP 的方法,该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制,建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类,还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系,这种方法展示了高级视觉语言模型在医学图像分析中的潜力,最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。 - -##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis** -2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat - -Humour styles can have either a negative or a positive impact on well-being. -Given the importance of these styles to mental health, significant research has -been conducted on their automatic identification. However, the automated -machine learning models used for this purpose are black boxes, making their -prediction decisions opaque. Clarity and transparency are vital in the field of -mental health. This paper presents an explainable AI (XAI) framework for -understanding humour style classification, building upon previous work in -computational humour analysis. Using the best-performing single model -(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to -analyse how linguistic, emotional, and semantic features contribute to humour -style classification decisions. Our analysis reveals distinct patterns in how -different humour styles are characterised and misclassified, with particular -emphasis on the challenges in distinguishing affiliative humour from other -styles. Through detailed examination of feature importance, error patterns, and -misclassification cases, we identify key factors influencing model decisions, -including emotional ambiguity, context misinterpretation, and target -identification. The framework demonstrates significant utility in understanding -model behaviour, achieving interpretable insights into the complex interplay of -features that define different humour styles. Our findings contribute to both -the theoretical understanding of computational humour analysis and practical -applications in mental health, content moderation, and digital humanities -research. - -摘要:幽默風格對幸福感可能產生負面或正面的影響。 -鑑於這些風格對心理健康的重要性,已經對其自動識別進行了大量研究。然而,用於此目的的自動機器學習模型是黑盒子,使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架,用於理解幽默風格分類,建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost),我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式,特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例,我們確定了影響模型決策的關鍵因素,包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用,實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。 - -##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support** -2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani - -The increasing demand for mental health services has highlighted the need for -innovative solutions, particularly in the realm of psychological conversational -AI, where the availability of sensitive data is scarce. In this work, we -explored the development of a system tailored for mental health support with a -novel approach to psychological assessment based on explainable emotional -profiles in combination with empathetic conversational models, offering a -promising tool for augmenting traditional care, particularly where immediate -expertise is unavailable. Our work can be divided into two main parts, -intrinsecaly connected to each other. First, we present RACLETTE, a -conversational system that demonstrates superior emotional accuracy compared to -state-of-the-art benchmarks in both understanding users' emotional states and -generating empathetic responses during conversations, while progressively -building an emotional profile of the user through their interactions. Second, -we show how the emotional profiles of a user can be used as interpretable -markers for mental health assessment. These profiles can be compared with -characteristic emotional patterns associated with different mental disorders, -providing a novel approach to preliminary screening and support. - -摘要:隨著對心理健康服務需求的增加,凸顯了創新解決方案的需求,特別是在心理對話式人工智慧領域,那裡缺乏敏感資料。在這項工作中,我們探索了開發一個針對心理健康支持的系統,採用一種基於可解釋的情緒特徵的新方法進行心理評估,結合同理心對話模式,提供了一個有前途的工具,用於擴充傳統照護,特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分,彼此內在相關。首先,我們展示了 RACLETTE,一個對話系統,與最先進的基準相比,在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性,同時透過他們的互動逐漸建立使用者的情緒特徵。其次,我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較,提供了一種初步篩選和支持的新方法。 - -##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation** -2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia - -Artificial intelligence (AI) has emerged as a powerful tool to enhance -decision-making and optimize treatment protocols in in vitro fertilization -(IVF). In particular, AI shows significant promise in supporting -decision-making during the ovarian stimulation phase of the IVF process. This -review evaluates studies focused on the applications of AI combined with -medical imaging in ovarian stimulation, examining methodologies, outcomes, and -current limitations. Our analysis of 13 studies on this topic reveals that, -reveal that while AI algorithms demonstrated notable potential in predicting -optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the -medical imaging data utilized predominantly came from two-dimensional (2D) -ultrasound which mainly involved basic quantifications, such as follicle size -and number, with limited use of direct feature extraction or advanced image -analysis techniques. This points to an underexplored opportunity where advanced -image analysis approaches, such as deep learning, and more diverse imaging -modalities, like three-dimensional (3D) ultrasound, could unlock deeper -insights. Additionally, the lack of explainable AI (XAI) in most studies raises -concerns about the transparency and traceability of AI-driven decisions - key -factors for clinical adoption and trust. Furthermore, many studies relied on -single-center designs and small datasets, which limit the generalizability of -their findings. This review highlights the need for integrating advanced -imaging analysis techniques with explainable AI methodologies, as well as the -importance of leveraging multicenter collaborations and larger datasets. -Addressing these gaps has the potential to enhance ovarian stimulation -management, paving the way for efficient, personalized, and data-driven -treatment pathways that improve IVF outcomes. - -摘要:人工智慧(AI)已成為增強體外受精(IVF)決策制定和優化治療方案的強大工具。特別是,AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示,雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力,但所利用的醫學影像數據主要來自於二次元(2D)超音波,而二次元超音波主要涉及基本量化,例如濾泡大小和數量,且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會,例如深度學習等進階影像分析方法,以及更多元的影像模式,例如三維(3D)超音波,可以解鎖更深入的見解。此外,大多數研究缺乏可解釋 AI(XAI),這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂,而透明度和可追溯性是臨床採用和信任的關鍵因素。此外,許多研究依賴於單中心設計和小型數據集,這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性,以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理,為有效、個人化和數據驅動的治療途徑鋪平道路,進而改善 IVF 結果。 - -##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models** -2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman - -This research presents an innovative approach to cancer diagnosis and -prediction using explainable Artificial Intelligence (XAI) and deep learning -techniques. With cancer causing nearly 10 million deaths globally in 2020, -early and accurate diagnosis is crucial. Traditional methods often face -challenges in cost, accuracy, and efficiency. Our study develops an AI model -that provides precise outcomes and clear insights into its decision-making -process, addressing the "black box" problem of deep learning models. By -employing XAI techniques, we enhance interpretability and transparency, -building trust among healthcare professionals and patients. Our approach -leverages neural networks to analyse extensive datasets, identifying patterns -for cancer detection. This model has the potential to revolutionise diagnosis -by improving accuracy, accessibility, and clarity in medical decision-making, -possibly leading to earlier detection and more personalised treatment -strategies. Furthermore, it could democratise access to high-quality -diagnostics, particularly in resource-limited settings, contributing to global -health equity. The model's applications extend beyond cancer diagnosis, -potentially transforming various aspects of medical decision-making and saving -millions of lives worldwide. - -摘要:本研究提出了一個創新的癌症診斷和預測方法,使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡,因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型,它提供精確的結果並清楚地了解其決策過程,解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術,我們增強了解釋性和透明度,在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集,識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷,可能導致更早的檢測和更個性化的治療策略。此外,它可以使更多人獲得高品質的診斷,特別是在資源有限的環境中,有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷,還可能轉變醫療決策的各個方面,並拯救全球數百萬人的生命。 - -##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG** -2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag - -Deep learning has advanced medical image classification, but interpretability -challenges hinder its clinical adoption. This study enhances interpretability -in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs) -and a multi-agent Retrieval-Augmented Generation (RAG) system for report -generation. By modeling relationships between visual features and clinical -concepts, we create interpretable concept vectors that guide a multi-agent RAG -system to generate radiology reports, enhancing clinical relevance, -explainability, and transparency. Evaluation of the generated reports using an -LLM-as-a-judge confirmed the interpretability and clinical utility of our -model's outputs. On the COVID-QU dataset, our model achieved 81% classification -accuracy and demonstrated robust report generation performance, with five key -metrics ranging between 84% and 90%. This interpretable multi-agent framework -bridges the gap between high-performance AI and the explainability required for -reliable AI-driven CXR analysis in clinical settings. Our code is available at -https://github.com/tifat58/IRR-with-CBM-RAG.git. - -摘要:深度學習已提升醫學影像分類,但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成,來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係,我們建立可解釋的概念向量,引導多代理 RAG 系統生成放射報告,增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估,確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上,我們的模型達到了 81% 的分類準確率,並展示了穩健的報告生成效能,五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。 - -##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models** -2412.15748v1 by Shamus Sim, Tyrone Chen - -Background: Despite the current ubiquity of Large Language Models (LLMs) -across the medical domain, there is a surprising lack of studies which address -their reasoning behaviour. We emphasise the importance of understanding -reasoning behaviour as opposed to high-level prediction accuracies, since it is -equivalent to explainable AI (XAI) in this context. In particular, achieving -XAI in medical LLMs used in the clinical domain will have a significant impact -across the healthcare sector. Results: Therefore, we define the concept of -reasoning behaviour in the specific context of medical LLMs. We then categorise -and discuss the current state of the art of methods which evaluate reasoning -behaviour in medical LLMs. Finally, we propose theoretical frameworks which can -empower medical professionals or machine learning engineers to gain insight -into the low-level reasoning operations of these previously obscure models. -Conclusion: The subsequent increased transparency and trust in medical machine -learning models by clinicians as well as patients will accelerate the -integration, application as well as further development of medical AI for the -healthcare system as a whole - -摘要:背景:儘管大型語言模型 (LLM) 目前在醫療領域無所不在,但令人驚訝的是,探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要,因為在這種情況下,這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI,將對整個醫療保健產業產生重大影響。結果:因此,我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後,我們提出理論架構,讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論:臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升,將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。 - -##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media** -2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton - -Stress is a pervasive global health issue that can lead to severe mental -health problems. Early detection offers timely intervention and prevention of -stress-related disorders. The current early detection models perform "black -box" inference suffering from limited explainability and trust which blocks the -real-world clinical application. Thanks to the generative properties introduced -by the Large Language Models (LLMs), the decision and the prediction from such -models are semi-interpretable through the corresponding description. However, -the existing LLMs are mostly trained for general purposes without the guidance -of psychological cognitive theory. To this end, we first highlight the -importance of prior theory with the observation of performance boosted by the -chain-of-thoughts tailored for stress detection. This method termed Cognition -Chain explicates the generation of stress through a step-by-step cognitive -perspective based on cognitive appraisal theory with a progress pipeline: -Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress -State, guiding LLMs to provide comprehensive reasoning explanations. We further -study the benefits brought by the proposed Cognition Chain format by utilising -it as a synthetic dataset generation template for LLMs instruction-tuning and -introduce CogInstruct, an instruction-tuning dataset for stress detection. This -dataset is developed using a three-stage self-reflective annotation pipeline -that enables LLMs to autonomously generate and refine instructional data. By -instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable -stress detection model. Evaluations demonstrate that CogLLM achieves -outstanding performance while enhancing explainability. Our work contributes a -novel approach by integrating cognitive theories into LLM reasoning processes, -offering a promising direction for future explainable AI research. - -摘要:壓力是一個普遍的全球性健康問題,可能會導致嚴重的精神 -健康問題。早期發現提供及時的干預和預防 -壓力相關疾病。目前的早期發現模型執行「黑 -盒子」推論,存在可解釋性和信任度有限的問題,阻礙了 -現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性,此類 -模型的決策和預測通過對應描述具有半可解釋性。然而, -現有的 LLM 主要針對一般用途進行訓練,沒有心理認知理論的指導。為此,我們首先強調 -先驗理論的重要性,並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知 -鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生,並具有進度管道: -刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力 -狀態,指導 LLM 提供全面的推理解釋。我們進一步 -通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點,並介紹 CogInstruct,這是一個針對壓力檢測的指令調整數據集。這個 -數據集是使用一個三階段的自省標註管道開發的,使 LLM 能夠自主生成和優化指令數據。通過 -使用 CogInstruct 對 Llama3 進行指令調整,我們開發了 CogLLM,這是一個可解釋的 -壓力檢測模型。評估表明,CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中,提出了一種新穎的方法, -為未來的可解釋人工智能研究提供了一個有希望的方向。 - -##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology** -2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi - -Human-machine teaming in medical AI requires us to understand to what degree -a trained clinician should weigh AI predictions. While previous work has shown -the potential of AI assistance at improving clinical predictions, existing -clinical decision support systems either provide no explainability of their -predictions or use techniques like saliency and Shapley values, which do not -allow for physician-based verification. To address this gap, this study -compares previously used explainable AI techniques with a newly proposed -technique termed '2-factor retrieval (2FR)', which is a combination of -interface design and search retrieval that returns similarly labeled data -without processing this data. This results in a 2-factor security blanket -where: (a) correct images need to be retrieved by the AI; and (b) humans should -associate the retrieved images with the current pathology under test. We find -that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician -accuracy, with particular improvements when clinicians are radiologists and -have low confidence in their decision. Our results highlight the importance of -understanding how different modes of human-AI decision making may impact -clinician accuracy in clinical decision support systems. - -摘要:人機協作在醫療 AI 中,需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力,但現有的臨床決策支援系統,要不就沒有提供預測的可解釋性,要不就是使用像顯著性和 Shapley 值之類的技術,這些技術不允許基於醫生的驗證。為了解決這個差距,本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較,後者是一種介面設計和搜尋檢索的組合,它會傳回標籤相似的資料,而不會處理這些資料。這會產生一個 2 因子安全機制,其中:(a) 正確的影像需要由 AI 檢索;(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現,當在胸部 X 光診斷上進行測試時,2FR 會提高臨床醫生的準確度,特別是在臨床醫生是放射科醫生且對其決策信心不足時,會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。 - -##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance** -2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle - -Understanding public perception of artificial intelligence (AI) and the -tradeoffs between potential risks and benefits is crucial, as these perceptions -might shape policy decisions, influence innovation trajectories for successful -market strategies, and determine individual and societal acceptance of AI -technologies. Using a representative sample of 1100 participants from Germany, -this study examines mental models of AI. Participants quantitatively evaluated -71 statements about AI's future capabilities (e.g., autonomous driving, medical -care, art, politics, warfare, and societal divides), assessing the expected -likelihood of occurrence, perceived risks, benefits, and overall value. We -present rankings of these projections alongside visual mappings illustrating -public risk-benefit tradeoffs. While many scenarios were deemed likely, -participants often associated them with high risks, limited benefits, and low -overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in -value assessment can be explained by perceived risks ($\beta=-.504$) and -perceived benefits ($\beta=+.710$), with no significant relation to expected -likelihood. Demographics and personality traits influenced perceptions of -risks, benefits, and overall evaluations, underscoring the importance of -increasing AI literacy and tailoring public information to diverse user needs. -These findings provide actionable insights for researchers, developers, and -policymakers by highlighting critical public concerns and individual factors -essential to align AI development with individual values. - -摘要:了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要,因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡,並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本,探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述(例如,自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧)進行了定量評估,評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名,並附上視覺化映射,說明了公眾的風險收益權衡。儘管許多場景被認為是可能的,但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中,96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋,與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法,這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素,為研究人員、開發人員和政策制定者提供了可行的見解。 - -##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset** -2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey - -The use of machine learning and AI on electronic health records (EHRs) holds -substantial potential for clinical insight. However, this approach faces -challenges due to data heterogeneity, sparsity, temporal misalignment, and -limited labeled outcomes. In this context, we leverage a linked EHR dataset of -approximately one million de-identified individuals from Bristol, North -Somerset, and South Gloucestershire, UK, to characterize urinary tract -infections (UTIs). We implemented a data pre-processing and curation pipeline -that transforms the raw EHR data into a structured format suitable for -developing predictive models focused on data fairness, accountability and -transparency. Given the limited availability and biases of ground truth UTI -outcomes, we introduce a UTI risk estimation framework informed by clinical -expertise to estimate UTI risk across individual patient timelines. Pairwise -XGBoost models are trained using this framework to differentiate UTI risk -categories with explainable AI techniques applied to identify key predictors -and support interpretability. Our findings reveal differences in clinical and -demographic predictors across risk groups. While this study highlights the -potential of AI-driven insights to support UTI clinical decision-making, -further investigation of patient sub-strata and extensive validation are needed -to ensure robustness and applicability in clinical practice. - -摘要:電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而,由於資料異質性、稀疏性、時間錯位和標籤結果有限,此方法面臨挑戰。在此背景下,我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集,來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線,適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差,我們引入了由臨床專業知識告知的 UTI 風險評估架構,以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練,以區分 UTI 風險類別,並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力,但仍需要進一步調查患者子群體和廣泛驗證,以確保在臨床實務中的穩健性和適用性。 - -##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care** -2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez - -There is a growing need to understand how digital systems can support -clinical decision-making, particularly as artificial intelligence (AI) models -become increasingly complex and less human-interpretable. This complexity -raises concerns about trustworthiness, impacting safe and effective adoption of -such technologies. Improved understanding of decision-making processes and -requirements for explanations coming from decision support tools is a vital -component in providing effective explainable solutions. This is particularly -relevant in the data-intensive, fast-paced environments of intensive care units -(ICUs). To explore these issues, group interviews were conducted with seven ICU -clinicians, representing various roles and experience levels. Thematic analysis -revealed three core themes: (T1) ICU decision-making relies on a wide range of -factors, (T2) the complexity of patient state is challenging for shared -decision-making, and (T3) requirements and capabilities of AI decision support -systems. We include design recommendations from clinical input, providing -insights to inform future AI systems for intensive care. - -摘要:隨著人工智慧 (AI) 模型變得越來越複雜,且越來越難以被人理解,了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮,影響了此類技術的安全且有效採用。改善對決策制定流程的理解,以及對決策支援工具所提供說明的要求,是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題,對七位 ICU 臨床醫師進行了小組訪談,這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題:(T1) ICU 決策制定依賴於廣泛的因素,(T2) 病患狀態的複雜性對共同決策制定構成挑戰,以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議,提供見解以提供資訊給未來用於加護的 AI 系統。 - -##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning** -2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra - -Pediatric heart diseases present a broad spectrum of congenital and acquired -diseases. More complex congenital malformations require a differentiated and -multimodal decision-making process, usually including echocardiography as a -central imaging method. Artificial intelligence (AI) offers considerable -promise for clinicians by facilitating automated interpretation of pediatric -echocardiography data. However, adapting AI technologies for pediatric -echocardiography analysis has challenges such as limited public data -availability, data privacy, and AI model transparency. Recently, researchers -have focused on disruptive technologies, such as federated learning (FL) and -explainable AI (XAI), to improve automatic diagnostic and decision support -workflows. This study offers a comprehensive overview of the limitations and -opportunities of AI in pediatric echocardiography, emphasizing the synergistic -workflow and role of XAI and FL, identifying research gaps, and exploring -potential future developments. Additionally, three relevant clinical use cases -demonstrate the functionality of XAI and FL with a focus on (i) view -recognition, (ii) disease classification, (iii) segmentation of cardiac -structures, and (iv) quantitative assessment of cardiac function. - -摘要:小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程,通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望,因為它可以促進小兒超音波檢查資料的自動化解讀。然而,將人工智慧技術應用於小兒超音波檢查分析有許多挑戰,例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近,研究人員專注於破壞性技術,例如聯合學習 (FL) 和可解釋人工智慧 (XAI),以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述,強調了 XAI 和 FL 的協同工作流程和角色,找出研究差距並探討潛在的未來發展。此外,三個相關的臨床使用案例展示了 XAI 和 FL 的功能,重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。 - -##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering** -2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust - -Osteoporosis is a common condition that increases fracture risk, especially -in older adults. Early diagnosis is vital for preventing fractures, reducing -treatment costs, and preserving mobility. However, healthcare providers face -challenges like limited labeled data and difficulties in processing medical -images. This study presents a novel multi-modal learning framework that -integrates clinical and imaging data to improve diagnostic accuracy and model -interpretability. The model utilizes three pre-trained networks-VGG19, -InceptionV3, and ResNet50-to extract deep features from X-ray images. These -features are transformed using PCA to reduce dimensionality and focus on the -most relevant components. A clustering-based selection process identifies the -most representative components, which are then combined with preprocessed -clinical data and processed through a fully connected network (FCN) for final -classification. A feature importance plot highlights key variables, showing -that Medical History, BMI, and Height were the main contributors, emphasizing -the significance of patient-specific data. While imaging features were -valuable, they had lower importance, indicating that clinical data are crucial -for accurate predictions. This framework promotes precise and interpretable -predictions, enhancing transparency and building trust in AI-driven diagnoses -for clinical integration. - -摘要:骨質疏鬆症是一種常見的疾病,會增加骨折的風險,特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而,醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架,該框架整合了臨床和影像數據,以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路,VGG19、InceptionV3 和 ResNet50,從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分,然後將這些組成部分與預處理的臨床數據結合,並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數,表明病史、BMI 和身高是主要貢獻因素,強調了患者特定數據的重要性。雖然影像特徵很有價值,但它們的重要性較低,這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測,提高了透明度,並建立了對 AI 驅動診斷在臨床整合中的信任。 - -##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection** -2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor - -This review paper explores recent advances in deep learning approaches for -non-invasive cognitive impairment detection. We examine various non-invasive -indicators of cognitive decline, including speech and language, facial, and -motoric mobility. The paper provides an overview of relevant datasets, -feature-extracting techniques, and deep-learning architectures applied to this -domain. We have analyzed the performance of different methods across modalities -and observed that speech and language-based methods generally achieved the -highest detection performance. Studies combining acoustic and linguistic -features tended to outperform those using a single modality. Facial analysis -methods showed promise for visual modalities but were less extensively studied. -Most papers focused on binary classification (impaired vs. non-impaired), with -fewer addressing multi-class or regression tasks. Transfer learning and -pre-trained language models emerged as popular and effective techniques, -especially for linguistic analysis. Despite significant progress, several -challenges remain, including data standardization and accessibility, model -explainability, longitudinal analysis limitations, and clinical adaptation. -Lastly, we propose future research directions, such as investigating -language-agnostic speech analysis methods, developing multi-modal diagnostic -systems, and addressing ethical considerations in AI-assisted healthcare. By -synthesizing current trends and identifying key obstacles, this review aims to -guide further development of deep learning-based cognitive impairment detection -systems to improve early diagnosis and ultimately patient outcomes. - -摘要:本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標,包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現,並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力,但研究較少。大多數論文專注於二元分類(受損與未受損),較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術,特別是對於語言分析。儘管取得了重大進展,但仍存在一些挑戰,包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後,我們提出了未來的研究方向,例如調查與語言無關的語音分析方法、開發多模式診斷系統,以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙,本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展,以改善早期診斷,並最終改善患者的治療結果。 - -##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems** -2410.17504v1 by Shruthi Chari - -Explainable Artificial Intelligence (AI) focuses on helping humans understand -the working of AI systems or their decisions and has been a cornerstone of AI -for decades. Recent research in explainability has focused on explaining the -workings of AI models or model explainability. There have also been several -position statements and review papers detailing the needs of end-users for -user-centered explainability but fewer implementations. Hence, this thesis -seeks to bridge some gaps between model and user-centered explainability. We -create an explanation ontology (EO) to represent literature-derived explanation -types via their supporting components. We implement a knowledge-augmented -question-answering (QA) pipeline to support contextual explanations in a -clinical setting. Finally, we are implementing a system to combine explanations -from different AI methods and data modalities. Within the EO, we can represent -fifteen different explanation types, and we have tested these representations -in six exemplar use cases. We find that knowledge augmentations improve the -performance of base large language models in the contextualized QA, and the -performance is variable across disease groups. In the same setting, clinicians -also indicated that they prefer to see actionability as one of the main foci in -explanations. In our explanations combination method, we plan to use similarity -metrics to determine the similarity of explanations in a chronic disease -detection setting. Overall, through this thesis, we design methods that can -support knowledge-enabled explanations across different use cases, accounting -for the methods in today's AI era that can generate the supporting components -of these explanations and domain knowledge sources that can enhance them. - -摘要:可解釋人工智慧(AI)專注於協助人類了解 AI 系統運作或其決策,數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求,但實作較少。因此,本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體(EO)以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答(QA)管線,以在臨床環境中支援情境解釋。最後,我們正在實作一個系統,以結合來自不同 AI 方法和資料模式的解釋。在 EO 中,我們可以表示 15 種不同的解釋類型,並且我們已在六個範例使用案例中測試這些表示。我們發現,知識增強改善了基礎大型語言模型在情境化 QA 中的效能,並且效能因疾病群組而異。在相同的環境中,臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中,我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言,透過本論文,我們設計了可以在不同使用案例中支援知識啟用解釋的方法,考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。 - -##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study** -2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay - -Objectives: To investigate clinicians' attitudes towards current automated -interpretation of ECG and novel AI technologies and their perception of -computer-assisted interpretation. Materials and Methods: We conducted a series -of interviews with clinicians in the UK. Our study: (i) explores the potential -for AI, specifically future 'human-like' computing approaches, to facilitate -ECG interpretation and support clinical decision making, and (ii) elicits their -opinions about the importance of explainability and trustworthiness of AI -algorithms. Results: We performed inductive thematic analysis on interview -transcriptions from 23 clinicians and identified the following themes: (i) a -lack of trust in current systems, (ii) positive attitudes towards future AI -applications and requirements for these, (iii) the relationship between the -accuracy and explainability of algorithms, and (iv) opinions on education, -possible deskilling, and the impact of AI on clinical competencies. Discussion: -Clinicians do not trust current computerised methods, but welcome future 'AI' -technologies. Where clinicians trust future AI interpretation to be accurate, -they are less concerned that it is explainable. They also preferred ECG -interpretation that demonstrated the results of the algorithm visually. Whilst -clinicians do not fear job losses, they are concerned about deskilling and the -need to educate the workforce to use AI responsibly. Conclusion: Clinicians are -positive about the future application of AI in clinical decision-making. -Accuracy is a key factor of uptake and visualisations are preferred over -current computerised methods. This is viewed as a potential means of training -and upskilling, in contrast to the deskilling that automation might be -perceived to bring. - -摘要:目的:調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度,以及他們對電腦輔助解讀的看法。材料和方法:我們對英國的臨床醫生進行了一系列訪談。我們的研究:(i) 探討人工智慧的潛力,特別是未來的「類人類」運算方法,以促進心電圖解讀並支持臨床決策制定,以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果:我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析,並找出以下主題:(i) 對目前系統缺乏信任,(ii) 對未來人工智慧應用和對這些應用的要求持正面態度,(iii) 演算法的準確性和可解釋性之間的關係,以及 (iv) 對教育、可能的技能退化,以及人工智慧對臨床能力的影響的看法。討論:臨床醫生不信任目前的電腦化方法,但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下,他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業,但他們擔心技能退化,以及需要教育員工負責任地使用人工智慧。結論:臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素,而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法,與自動化可能帶來的技能退化形成對比。 - -##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer** -2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker - -The aggressiveness of prostate cancer, the most common cancer in men -worldwide, is primarily assessed based on histopathological data using the -Gleason scoring system. While artificial intelligence (AI) has shown promise in -accurately predicting Gleason scores, these predictions often lack inherent -explainability, potentially leading to distrust in human-machine interactions. -To address this issue, we introduce a novel dataset of 1,015 tissue microarray -core images, annotated by an international group of 54 pathologists. The -annotations provide detailed localized pattern descriptions for Gleason grading -in line with international guidelines. Utilizing this dataset, we develop an -inherently explainable AI system based on a U-Net architecture that provides -predictions leveraging pathologists' terminology. This approach circumvents -post-hoc explainability methods while maintaining or exceeding the performance -of methods trained directly for Gleason pattern segmentation (Dice score: 0.713 -$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason -patterns). By employing soft labels during training, we capture the intrinsic -uncertainty in the data, yielding strong results in Gleason pattern -segmentation even in the context of high interobserver variability. With the -release of this dataset, we aim to encourage further research into segmentation -in medical tasks with high levels of subjectivity and to advance the -understanding of pathologists' reasoning processes. +|**2025-02-20**|**LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention**|Shang Yang et.al.|[2502.14866v1](http://arxiv.org/abs/2502.14866v1)|null| +|**2025-02-20**|**Interpretable Text Embeddings and Text Similarity Explanation: A Primer**|Juri Opitz et.al.|[2502.14862v1](http://arxiv.org/abs/2502.14862v1)|null| +|**2025-02-20**|**Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning**|Shuyue Stella Li et.al.|[2502.14860v1](http://arxiv.org/abs/2502.14860v1)|null| +|**2025-02-20**|**FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling**|Weilin Zhao et.al.|[2502.14856v1](http://arxiv.org/abs/2502.14856v1)|null| +|**2025-02-20**|**Prompt-to-Leaderboard**|Evan Frick et.al.|[2502.14855v1](http://arxiv.org/abs/2502.14855v1)|null| +|**2025-02-20**|**CLIPPER: Compression enables long-context synthetic data generation**|Chau Minh Pham et.al.|[2502.14854v1](http://arxiv.org/abs/2502.14854v1)|null| +|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| +|**2025-02-20**|**Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation**|Yue Yang et.al.|[2502.14846v1](http://arxiv.org/abs/2502.14846v1)|null| +|**2025-02-20**|**Revealing and Mitigating Over-Attention in Knowledge Editing**|Pinzheng Wang et.al.|[2502.14838v1](http://arxiv.org/abs/2502.14838v1)|null| +|**2025-02-20**|**Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs**|Tao Ji et.al.|[2502.14837v1](http://arxiv.org/abs/2502.14837v1)|null| +|**2025-02-20**|**LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models**|Shangqing Tu et.al.|[2502.14834v1](http://arxiv.org/abs/2502.14834v1)|null| +|**2025-02-20**|**Improving the Diffusability of Autoencoders**|Ivan Skorokhodov et.al.|[2502.14831v1](http://arxiv.org/abs/2502.14831v1)|null| +|**2025-02-20**|**Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs**|Danni Liu et.al.|[2502.14830v1](http://arxiv.org/abs/2502.14830v1)|null| +|**2025-02-20**|**Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps**|Martin Tutek et.al.|[2502.14829v1](http://arxiv.org/abs/2502.14829v1)|null| +|**2025-02-20**|**Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison**|Aiswarya Baby et.al.|[2502.14827v1](http://arxiv.org/abs/2502.14827v1)|null| +|**2025-02-20**|**eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables**|Luis Antonio Gutiérrez Guanilo et.al.|[2502.14820v1](http://arxiv.org/abs/2502.14820v1)|null| +|**2025-02-20**|**Optimizing Model Selection for Compound AI Systems**|Lingjiao Chen et.al.|[2502.14815v1](http://arxiv.org/abs/2502.14815v1)|null| +|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| +|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| +|**2025-02-20**|**A Survey on Text-Driven 360-Degree Panorama Generation**|Hai Wang et.al.|[2502.14799v1](http://arxiv.org/abs/2502.14799v1)|null| +|**2025-02-20**|**Rapid Word Learning Through Meta In-Context Learning**|Wentao Wang et.al.|[2502.14791v1](http://arxiv.org/abs/2502.14791v1)|null| +|**2025-02-20**|**SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features**|Michael Tschannen et.al.|[2502.14786v1](http://arxiv.org/abs/2502.14786v1)|[link](https://github.com/google-research/big_vision)| +|**2025-02-20**|**ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting**|Abhijit Mishra et.al.|[2502.14780v1](http://arxiv.org/abs/2502.14780v1)|null| +|**2025-02-20**|**Harnessing PDF Data for Improving Japanese Large Multimodal Models**|Jeonghun Baek et.al.|[2502.14778v1](http://arxiv.org/abs/2502.14778v1)|null| +|**2025-02-20**|**Making Universal Policies Universal**|Niklas Höpner et.al.|[2502.14777v1](http://arxiv.org/abs/2502.14777v1)|null| +|**2025-02-20**|**SurveyX: Academic Survey Automation via Large Language Models**|Xun Liang et.al.|[2502.14776v1](http://arxiv.org/abs/2502.14776v1)|null| +|**2025-02-20**|**Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning**|Tian Xie et.al.|[2502.14768v1](http://arxiv.org/abs/2502.14768v1)|null| +|**2025-02-20**|**Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis**|Priyanka Kargupta et.al.|[2502.14767v1](http://arxiv.org/abs/2502.14767v1)|null| +|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| +|**2025-02-20**|**EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations**|Haotian Zhai et.al.|[2502.14760v1](http://arxiv.org/abs/2502.14760v1)|null| +|**2025-02-20**|**On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems**|Juraj Vladika et.al.|[2502.14759v1](http://arxiv.org/abs/2502.14759v1)|null| +|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| +|**2025-02-20**|**TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators**|Jianling Li et.al.|[2502.14752v1](http://arxiv.org/abs/2502.14752v1)|null| +|**2025-02-20**|**Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs**|Zongxia Li et.al.|[2502.14748v1](http://arxiv.org/abs/2502.14748v1)|null| +|**2025-02-20**|**HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States**|Yilei Jiang et.al.|[2502.14744v1](http://arxiv.org/abs/2502.14744v1)|null| +|**2025-02-20**|**Multi-Agent Coordination across Diverse Applications: A Survey**|Lijun Sun et.al.|[2502.14743v1](http://arxiv.org/abs/2502.14743v1)|null| +|**2025-02-20**|**YOLOv12: A Breakdown of the Key Architectural Features**|Mujadded Al Rabbani Alif et.al.|[2502.14740v1](http://arxiv.org/abs/2502.14740v1)|null| +|**2025-02-20**|**SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines**|M-A-P Team et.al.|[2502.14739v1](http://arxiv.org/abs/2502.14739v1)|null| +|**2025-02-20**|**EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration**|Minjie Hong et.al.|[2502.14735v1](http://arxiv.org/abs/2502.14735v1)|null| +|**2025-02-20**|**Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models**|Hongji Li et.al.|[2502.14734v1](http://arxiv.org/abs/2502.14734v1)|null| +|**2025-02-20**|**WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models**|Yifu Chen et.al.|[2502.14727v1](http://arxiv.org/abs/2502.14727v1)|null| +|**2025-02-20**|**Entity Framing and Role Portrayal in the News**|Tarek Mahmoud et.al.|[2502.14718v1](http://arxiv.org/abs/2502.14718v1)|null| +|**2025-02-20**|**From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT**|Ahmed Abdeen Hamed et.al.|[2502.14714v1](http://arxiv.org/abs/2502.14714v1)|null| +|**2025-02-20**|**Data-Efficient Pretraining with Group-Level Data Influence Modeling**|Zichun Yu et.al.|[2502.14709v1](http://arxiv.org/abs/2502.14709v1)|null| +|**2025-02-20**|**Human Misperception of Generative-AI Alignment: A Laboratory Experiment**|Kevin He et.al.|[2502.14708v1](http://arxiv.org/abs/2502.14708v1)|null| +|**2025-02-20**|**Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting**|Yuxuan Yang et.al.|[2502.14704v1](http://arxiv.org/abs/2502.14704v1)|null| +|**2025-02-20**|**I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search**|Zujie Liang et.al.|[2502.14693v1](http://arxiv.org/abs/2502.14693v1)|null| +|**2025-02-20**|**Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup**|Yonghui Kong et.al.|[2502.14682v1](http://arxiv.org/abs/2502.14682v1)|null| +|**2025-02-20**|**How to Get Your LLM to Generate Challenging Problems for Evaluation**|Arkil Patel et.al.|[2502.14678v1](http://arxiv.org/abs/2502.14678v1)|null| +|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| +|**2025-02-20**|**BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction**|Ruochen Li et.al.|[2502.14676v1](http://arxiv.org/abs/2502.14676v1)|null| +|**2025-02-20**|**Explanations of Deep Language Models Explain Language Representations in the Brain**|Maryam Rahimi et.al.|[2502.14671v1](http://arxiv.org/abs/2502.14671v1)|null| +|**2025-02-20**|**AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO**|Alan Dao et.al.|[2502.14669v1](http://arxiv.org/abs/2502.14669v1)|null| +|**2025-02-20**|**InstructAgent: Building User Controllable Recommender via LLM Agent**|Wujiang Xu et.al.|[2502.14662v1](http://arxiv.org/abs/2502.14662v1)|null| +|**2025-02-20**|**Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs**|Yuchen Wu et.al.|[2502.14645v1](http://arxiv.org/abs/2502.14645v1)|null| +|**2025-02-20**|**LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning**|Yansheng Mao et.al.|[2502.14644v1](http://arxiv.org/abs/2502.14644v1)|null| +|**2025-02-20**|**Length-Controlled Margin-Based Preference Optimization without Reference Model**|Gengxu Li et.al.|[2502.14643v1](http://arxiv.org/abs/2502.14643v1)|null| +|**2025-02-20**|**How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation**|Rui Li et.al.|[2502.14642v1](http://arxiv.org/abs/2502.14642v1)|null| +|**2025-02-20**|**NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization**|Zheyuan Zhang et.al.|[2502.14638v1](http://arxiv.org/abs/2502.14638v1)|null| +|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| +|**2025-02-20**|**PEARL: Towards Permutation-Resilient LLMs**|Liang Chen et.al.|[2502.14628v1](http://arxiv.org/abs/2502.14628v1)|null| +|**2025-02-20**|**ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors**|Yuguo Yin et.al.|[2502.14627v1](http://arxiv.org/abs/2502.14627v1)|null| +|**2025-02-20**|**Multi-Record Web Page Information Extraction From News Websites**|Alexander Kustenkov et.al.|[2502.14625v1](http://arxiv.org/abs/2502.14625v1)|null| +|**2025-02-20**|**Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity**|Xinghan Pan et.al.|[2502.14620v1](http://arxiv.org/abs/2502.14620v1)|[link](https://github.com/PStarH/RWKV-embedding)| +|**2025-02-20**|**Reward Models Identify Consistency, Not Causality**|Yuhui Xu et.al.|[2502.14619v1](http://arxiv.org/abs/2502.14619v1)|null| +|**2025-02-20**|**FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis**|Mingyi Jia et.al.|[2502.14614v1](http://arxiv.org/abs/2502.14614v1)|null| +|**2025-02-20**|**Behavioral Analysis of Information Salience in Large Language Models**|Jan Trienes et.al.|[2502.14613v1](http://arxiv.org/abs/2502.14613v1)|null| +|**2025-02-20**|**A Theory for Conditional Generative Modeling on Multiple Data Sources**|Rongzhen Wang et.al.|[2502.14583v1](http://arxiv.org/abs/2502.14583v1)|null| +|**2025-02-20**|**A Statistical Case Against Empirical Human-AI Alignment**|Julian Rodemann et.al.|[2502.14581v1](http://arxiv.org/abs/2502.14581v1)|null| +|**2025-02-20**|**ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification**|Hyunseok Lee et.al.|[2502.14565v1](http://arxiv.org/abs/2502.14565v1)|null| +|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| +|**2025-02-20**|**Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs**|Paris Koloveas et.al.|[2502.14561v1](http://arxiv.org/abs/2502.14561v1)|null| +|**2025-02-20**|**Less is More: Improving LLM Alignment via Preference Data Selection**|Xun Deng et.al.|[2502.14560v1](http://arxiv.org/abs/2502.14560v1)|null| +|**2025-02-20**|**FUIA: Model Inversion Attack against Federated Unlearning**|Lei Zhou et.al.|[2502.14558v1](http://arxiv.org/abs/2502.14558v1)|null| +|**2025-02-20**|**Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling**|Eric Egli et.al.|[2502.14553v1](http://arxiv.org/abs/2502.14553v1)|null| +|**2025-02-20**|**Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks**|Maya Bechler-Speicher et.al.|[2502.14546v1](http://arxiv.org/abs/2502.14546v1)|null| +|**2025-02-20**|**LLM-based User Profile Management for Recommender System**|Seunghwan Bang et.al.|[2502.14541v1](http://arxiv.org/abs/2502.14541v1)|null| +|**2025-02-20**|**LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization**|Yupeng Chang et.al.|[2502.14538v1](http://arxiv.org/abs/2502.14538v1)|null| +|**2025-02-20**|**CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models**|Zhenhong Zhou et.al.|[2502.14529v1](http://arxiv.org/abs/2502.14529v1)|null| +|**2025-02-20**|**Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting**|Yannick Wölker et.al.|[2502.14525v1](http://arxiv.org/abs/2502.14525v1)|null| +|**2025-02-20**|**Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation**|Austin A. Barr et.al.|[2502.14523v1](http://arxiv.org/abs/2502.14523v1)|null| +|**2025-02-20**|**MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality**|Artur Kot et.al.|[2502.14509v1](http://arxiv.org/abs/2502.14509v1)|null| +|**2025-02-20**|**Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases**|Rena Gao et.al.|[2502.14507v1](http://arxiv.org/abs/2502.14507v1)|null| +|**2025-02-20**|**PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models**|Yu Meng et.al.|[2502.14504v1](http://arxiv.org/abs/2502.14504v1)|null| +|**2025-02-20**|**How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?**|Sergey Pletenev et.al.|[2502.14502v1](http://arxiv.org/abs/2502.14502v1)|null| +|**2025-02-20**|**Towards a Perspectivist Turn in Argument Quality Assessment**|Julia Romberg et.al.|[2502.14501v1](http://arxiv.org/abs/2502.14501v1)|null| +|**2025-02-20**|**MLGym: A New Framework and Benchmark for Advancing AI Research Agents**|Deepak Nathani et.al.|[2502.14499v1](http://arxiv.org/abs/2502.14499v1)|null| +|**2025-02-20**|**Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups**|Felix Drinkall et.al.|[2502.14497v1](http://arxiv.org/abs/2502.14497v1)|null| +|**2025-02-20**|**Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization**|Zhitao He et.al.|[2502.14496v1](http://arxiv.org/abs/2502.14496v1)|null| +|**2025-02-20**|**StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following**|Jinnan Li et.al.|[2502.14494v1](http://arxiv.org/abs/2502.14494v1)|null| +|**2025-02-20**|**Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk**|Elija Perrier et.al.|[2502.14491v1](http://arxiv.org/abs/2502.14491v1)|null| +|**2025-02-20**|**Temporal Misalignment and Probabilistic Neurons**|Velibor Bojković et.al.|[2502.14487v1](http://arxiv.org/abs/2502.14487v1)|null| +|**2025-02-20**|**How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation**|Zhuohang Long et.al.|[2502.14486v1](http://arxiv.org/abs/2502.14486v1)|null| +|**2025-02-20**|**NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models**|Chenlu Guo et.al.|[2502.14482v1](http://arxiv.org/abs/2502.14482v1)|null| +|**2025-02-20**|**Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression**|Haoyu Wang et.al.|[2502.14477v1](http://arxiv.org/abs/2502.14477v1)|null| +|**2025-02-20**|**Argument-Based Comparative Question Answering Evaluation Benchmark**|Irina Nikishina et.al.|[2502.14476v1](http://arxiv.org/abs/2502.14476v1)|null| +|**2025-02-20**|**Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models**|Aurora Polo-Rodríguez et.al.|[2502.14469v1](http://arxiv.org/abs/2502.14469v1)|null| +|**2025-02-20**|**Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing**|Aviv Bick et.al.|[2502.14458v1](http://arxiv.org/abs/2502.14458v1)|null| +|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| +|**2025-02-20**|**Optimal word order for non-causal text generation with Large Language Models: the Spanish case**|Andrea Busto-Castiñeira et.al.|[2502.14451v1](http://arxiv.org/abs/2502.14451v1)|null| -摘要:前列腺癌是全球男性最常見的癌症,其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力,但這些預測通常缺乏內在的可解釋性,可能會導致對人機互動的不信任。為了解決這個問題,我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述,用於符合國際準則的 Gleason 分級。利用這個資料集,我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統,該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法,同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能(Dice 分數:0.713 ± 0.003,訓練於解釋,相對於 0.691 ± 0.010,訓練於 Gleason 模式)。透過在訓練期間採用軟標籤,我們捕捉了資料中的內在不確定性,即使在觀察者間變異性高的情況下,也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集,我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割,並增進對病理學家推理過程的理解。 +#### Abstracts +##### **LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention** +2502.14866v1 by Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han -##### **Explainable AI Methods for Multi-Omics Analysis: A Survey** -2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee +Large language models (LLMs) have shown remarkable potential in processing +long sequences, yet efficiently serving these long-context models remains +challenging due to the quadratic computational complexity of attention in the +prefilling stage and the large memory footprint of the KV cache in the decoding +stage. To address these issues, we introduce LServe, an efficient system that +accelerates long-sequence LLM serving via hybrid sparse attention. This method +unifies different hardware-friendly, structured sparsity patterns for both +prefilling and decoding attention into a single framework, where computations +on less important tokens are skipped block-wise. LServe demonstrates the +compatibility of static and dynamic sparsity in long-context LLM attention. +This design enables multiplicative speedups by combining these optimizations. +Specifically, we convert half of the attention heads to nearly free streaming +heads in both the prefilling and decoding stages. Additionally, we find that +only a constant number of KV pages is required to preserve long-context +capabilities, irrespective of context length. We then design a hierarchical KV +page selection policy that dynamically prunes KV pages based on query-centric +similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and +decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is +released at https://github.com/mit-han-lab/omniserve. -Advancements in high-throughput technologies have led to a shift from -traditional hypothesis-driven methodologies to data-driven approaches. -Multi-omics refers to the integrative analysis of data derived from multiple -'omes', such as genomics, proteomics, transcriptomics, metabolomics, and -microbiomics. This approach enables a comprehensive understanding of biological -systems by capturing different layers of biological information. Deep learning -methods are increasingly utilized to integrate multi-omics data, offering -insights into molecular interactions and enhancing research into complex -diseases. However, these models, with their numerous interconnected layers and -nonlinear relationships, often function as black boxes, lacking transparency in -decision-making processes. To overcome this challenge, explainable artificial -intelligence (xAI) methods are crucial for creating transparent models that -allow clinicians to interpret and work with complex data more effectively. This -review explores how xAI can improve the interpretability of deep learning -models in multi-omics research, highlighting its potential to provide -clinicians with clear insights, thereby facilitating the effective application -of such models in clinical settings. +摘要:大型語言模型 (LLM) 在處理長序列方面展現出驚人的潛力,但由於預填充階段注意力的二次計算複雜度和解碼階段 KV 快取的大量記憶體使用量,有效提供這些長語境模型服務仍然具有挑戰性。為了解決這些問題,我們引入了 LServe,一個透過混合稀疏注意力加速長序列 LLM 服務的高效系統。此方法將不同的硬體友善的結構化稀疏模式統一到一個單一的架構中,用於預填充和解碼注意力,其中對較不重要的符號的運算會以區塊方式略過。LServe 證明了靜態和動態稀疏性在長語境 LLM 注意力中的相容性。此設計透過結合這些最佳化來實現倍增加速。具體來說,我們將一半的注意力頭轉換為預填充和解碼階段中幾乎免費的串流頭。此外,我們發現僅需要恆定的 KV 頁數來保留長語境功能,而與語境長度無關。然後,我們設計了一個分層式 KV 頁面選擇策略,根據以查詢為中心的相似性動態刪除 KV 頁面。平均而言,LServe 將 LLM 預填充加速了 2.9 倍,將解碼加速了 1.3-2.1 倍,同時維持長語境的準確性。程式碼已發布在 https://github.com/mit-han-lab/omniserve。 -摘要:高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料,例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面,能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料,提供分子交互作用的洞察力,並加強對複雜疾病的研究。然而,這些模型具有許多相互連接的層級和非線性關係,通常會像黑盒子一樣運作,缺乏決策過程的透明度。為了克服此挑戰,可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要,讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性,強調其提供臨床醫生明確見解的潛力,進而促進此類模型在臨床環境中的有效應用。 +##### **Interpretable Text Embeddings and Text Similarity Explanation: A Primer** +2502.14862v1 by Juri Opitz, Lucas Möller, Andrianos Michail, Simon Clematide -##### **Study on the Helpfulness of Explainable Artificial Intelligence** -2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing +Text embeddings and text embedding models are a backbone of many AI and NLP +systems, particularly those involving search. However, interpretability +challenges persist, especially in explaining obtained similarity scores, which +is crucial for applications requiring transparency. In this paper, we give a +structured overview of interpretability methods specializing in explaining +those similarity scores, an emerging research area. We study the methods' +individual ideas and techniques, evaluating their potential for improving +interpretability of text embeddings and explaining predicted similarities. -Explainable Artificial Intelligence (XAI) is essential for building advanced -machine learning-powered applications, especially in critical domains such as -medical diagnostics or autonomous driving. Legal, business, and ethical -requirements motivate using effective XAI, but the increasing number of -different methods makes it challenging to pick the right ones. Further, as -explanations are highly context-dependent, measuring the effectiveness of XAI -methods without users can only reveal a limited amount of information, -excluding human factors such as the ability to understand it. We propose to -evaluate XAI methods via the user's ability to successfully perform a proxy -task, designed such that a good performance is an indicator for the explanation -to provide helpful information. In other words, we address the helpfulness of -XAI for human decision-making. Further, a user study on state-of-the-art -methods was conducted, showing differences in their ability to generate trust -and skepticism and the ability to judge the rightfulness of an AI decision -correctly. Based on the results, we highly recommend using and extending this -approach for more objective-based human-centered user studies to measure XAI -performance in an end-to-end fashion. +摘要:文字嵌入和文字嵌入模型是許多 AI 和 NLP 系統的骨幹,特別是那些涉及搜尋的系統。然而,可解釋性的挑戰依然存在,特別是在解釋獲得的相似度分數時,這對於需要透明度的應用程式至關重要。在本文中,我們對專門用於解釋這些相似度分數的可解釋性方法給予結構化的概述,這是一個新興的研究領域。我們研究了這些方法的個別想法和技術,評估它們改善文字嵌入的可解釋性和解釋預測相似度的潛力。 -摘要:可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要,特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI,但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外,由於解釋高度依賴於背景,在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊,排除人類因素,例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法,設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說,我們探討 XAI 對人類決策制定的幫助。此外,對最先進的方法進行使用者研究,顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果,我們強烈建議使用和擴充這種方法,以進行更多以目標為基礎的人為中心使用者研究,以終端到終端的方式衡量 XAI 效能。 +##### **Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning** +2502.14860v1 by Shuyue Stella Li, Jimin Mun, Faeze Brahman, Jonathan S. Ilgen, Yulia Tsvetkov, Maarten Sap -##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health** -2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh +Large language models (LLMs) often fail to ask effective questions under +uncertainty, making them unreliable in domains where proactive +information-gathering is essential for decisionmaking. We present ALFA, a +framework that improves LLM question-asking by (i) decomposing the notion of a +"good" question into a set of theory-grounded attributes (e.g., clarity, +relevance), (ii) controllably synthesizing attribute-specific question +variations, and (iii) aligning models via preference-based optimization to +explicitly learn to ask better questions along these fine-grained attributes. +Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs +dataset, composed of 17k real-world clinical interactions augmented with 80k +attribute-specific preference pairs of follow-up questions, as well as a novel +expert-annotated interactive healthcare QA task to evaluate question-asking +abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on +MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level +win-rate of 64.4% and strong generalizability. Our findings suggest that +explicitly guiding question-asking with structured, fine-grained attributes +offers a scalable path to improve LLMs, especially in expert application +domains. -Early detection of intrapartum risk enables interventions to potentially -prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, -there is no accurate automated system to predict such events to assist with -clinical decision-making. To fill this gap, we propose "Artificial Intelligence -(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning -framework that not only predicts adverse labor outcomes from maternal, fetal, -obstetrical, and intrapartum risk factors but also provides the model's -reasoning behind the predictions made. The latter can provide insights into -what modifications in the input variables of the model could have changed the -predicted outcome. We address the challenges of imbalance and small datasets by -synthesizing additional training data using Adaptive Synthetic Sampling -(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN -uses an ensemble of fully-connected neural networks as the backbone for its -classification with the data augmentation supported by either ADASYN or CTGAN. -AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in -classification. AIMEN can predict a high risk for adverse labor outcomes with -an average F1 score of 0.784. It also provides counterfactual explanations that -can be achieved by changing 2 to 3 attributes on average. Resources available: -https://github.com/ab9mamun/AIMEN. +摘要:大型語言模型 (LLM) 經常在不確定性下無法提出有效問題,這使得它們在主動收集資訊對於決策制定至關重要的領域中不可靠。我們提出 ALFA,一個透過 (i) 將「良好」問題的概念分解成一組以理論為基礎的屬性(例如,清晰度、相關性),(ii) 可控地合成屬性特定的問題變體,以及 (iii) 透過基於偏好的最佳化調整模型,明確學習沿著這些細緻屬性提出更好的問題,來改善 LLM 提問的架構。專注於臨床推理作為案例研究,我們引入了 MediQ-AskDocs 資料集,由 17k 個真實世界的臨床互動組成,並增加了 80k 個屬性特定的後續問題偏好配對,以及一個由專家註解的互動式醫療保健問答任務來評估提問能力。與 SOTA 指令調整的 LLM 相比,與 ALFA 對齊的模型將 MediQ-AskDocs 上的診斷錯誤減少了 56.6%,問題層級的勝率為 64.4%,並且具有很強的普遍性。我們的研究結果表明,明確地以結構化、細緻的屬性來引導提問,提供了一條可擴充的途徑來改善 LLM,特別是在專家應用領域。 -摘要:產程中風險的早期偵測有助於進行干預措施,以預防或減輕不利的生產結果,例如腦性麻痺。目前,沒有準確的自動化系統可以預測此類事件,以協助臨床決策。為了填補這一空白,我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN),這是一個深度學習架構,它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果,還能提供模型做出預測背後的原因。後者可以提供見解,說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料,以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹,並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險,平均 F1 分數為 0.784。它還提供反事實解釋,可透過平均變更 2 至 3 個屬性來達成。可用資源:https://github.com/ab9mamun/AIMEN。 +##### **FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling** +2502.14856v1 by Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun -##### **Artificial intelligence techniques in inherited retinal diseases: A review** -2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian +Speculative sampling has emerged as an important technique for accelerating +the auto-regressive generation process of large language models (LLMs) by +utilizing a draft-then-verify mechanism to produce multiple tokens per forward +pass. While state-of-the-art speculative sampling methods use only a single +layer and a language modeling (LM) head as the draft model to achieve +impressive layer compression, their efficiency gains are substantially reduced +for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. +To address this, we present FR-Spec, a frequency-ranked speculative sampling +framework that optimizes draft candidate selection through vocabulary space +compression. By constraining the draft search to a frequency-prioritized token +subset, our method reduces LM Head computation overhead by 75% while ensuring +the equivalence of the final output distribution. Experiments across multiple +datasets demonstrate an average of 1.12$\times$ speedup over the +state-of-the-art speculative sampling method EAGLE-2. -Inherited retinal diseases (IRDs) are a diverse group of genetic disorders -that lead to progressive vision loss and are a major cause of blindness in -working-age adults. The complexity and heterogeneity of IRDs pose significant -challenges in diagnosis, prognosis, and management. Recent advancements in -artificial intelligence (AI) offer promising solutions to these challenges. -However, the rapid development of AI techniques and their varied applications -have led to fragmented knowledge in this field. This review consolidates -existing studies, identifies gaps, and provides an overview of AI's potential -in diagnosing and managing IRDs. It aims to structure pathways for advancing -clinical applications by exploring AI techniques like machine learning and deep -learning, particularly in disease detection, progression prediction, and -personalized treatment planning. Special focus is placed on the effectiveness -of convolutional neural networks in these areas. Additionally, the integration -of explainable AI is discussed, emphasizing its importance in clinical settings -to improve transparency and trust in AI-based systems. The review addresses the -need to bridge existing gaps in focused studies on AI's role in IRDs, offering -a structured analysis of current AI techniques and outlining future research -directions. It concludes with an overview of the challenges and opportunities -in deploying AI for IRDs, highlighting the need for interdisciplinary -collaboration and the continuous development of robust, interpretable AI models -to advance clinical applications. +摘要:推測取樣已成為一種重要的技術,可用於透過利用先起草後驗證的機制來加速大型語言模型 (LLM) 的自迴歸生成過程,並在每次前向傳遞中產生多個代幣。儘管最先進的推測取樣方法只使用單一層和語言建模 (LM) 頭作為起草模型,以達成令人印象深刻的層壓縮,但對於大型詞彙表 LLM(例如詞彙表包含 128k 個代幣的 Llama-3-8B),其效率提升會大幅降低。為了解決這個問題,我們提出了 FR-Spec,這是一種頻率排序推測取樣架構,它透過詞彙空間壓縮來最佳化起草候選選取。我們的這個方法透過將起草搜尋限制在優先於頻率的代幣子集中,將 LM 頭部運算開銷減少了 75%,同時確保最終輸出分佈的等效性。透過多個資料集的實驗證明,與最先進的推測取樣方法 EAGLE-2 相比,平均提速了 1.12 倍。 -摘要:遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病, -會導致視力逐漸喪失,是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。 -然而,AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究,找出差距,並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術,特別是在疾病檢測、進程預測和個性化治療計劃中,為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外,討論了可解釋 AI 的整合,強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性,提供了對當前 AI 技術的結構化分析,並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇,強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。 +##### **Prompt-to-Leaderboard** +2502.14855v1 by Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica -##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures** -2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri +Large language model (LLM) evaluations typically rely on aggregated metrics +like accuracy or human preference, averaging across users and prompts. This +averaging obscures user- and prompt-specific variations in model performance. +To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces +leaderboards specific to a prompt. The core idea is to train an LLM taking +natural language prompts as input to output a vector of Bradley-Terry +coefficients which are then used to predict the human preference vote. The +resulting prompt-dependent leaderboards allow for unsupervised task-specific +evaluation, optimal routing of queries to models, personalization, and +automated evaluation of model strengths and weaknesses. Data from Chatbot Arena +suggest that P2L better captures the nuanced landscape of language model +performance than the averaged leaderboard. Furthermore, our findings suggest +that P2L's ability to produce prompt-specific evaluations follows a power law +scaling similar to that observed in LLMs themselves. In January 2025, the +router we trained based on this methodology achieved the \#1 spot in the +Chatbot Arena leaderboard. Our code is available at this GitHub link: +https://github.com/lmarena/p2l. -Explaining Artificial Intelligence (AI) decisions is a major challenge -nowadays in AI, in particular when applied to sensitive scenarios like medicine -and law. However, the need to explain the rationale behind decisions is a main -issue also for human-based deliberation as it is important to justify -\textit{why} a certain decision has been taken. Resident medical doctors for -instance are required not only to provide a (possibly correct) diagnosis, but -also to explain how they reached a certain conclusion. Developing new tools to -aid residents to train their explanation skills is therefore a central -objective of AI in education. In this paper, we follow this direction, and we -present, to the best of our knowledge, the first multilingual dataset for -Medical Question Answering where correct and incorrect diagnoses for a clinical -case are enriched with a natural language explanation written by doctors. These -explanations have been manually annotated with argument components (i.e., -premise, claim) and argument relations (i.e., attack, support), resulting in -the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases -in four languages (English, Spanish, French, Italian) with explanations, where -we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 -attack relations. We conclude by showing how competitive baselines perform over -this challenging dataset for the argument mining task. +摘要:大型語言模型 (LLM) 評估通常依賴於彙總的指標,例如準確性或人類偏好,平均值跨使用者和提示。此平均值模糊了使用者和提示特定的模型效能變異。為了解決此問題,我們提出提示到排行榜 (P2L),一種產生特定於提示的排行榜的方法。核心概念是訓練 LLM,將自然語言提示作為輸入,以輸出 Bradley-Terry 係數向量,然後用於預測人類偏好投票。產生的提示相關排行榜允許無監督任務特定評估、最佳查詢路由至模型、個人化以及模型優缺點的自動化評估。來自 Chatbot Arena 的資料表明,P2L 比平均排行榜更能捕捉語言模型效能的細微變化。此外,我們的研究結果表明,P2L 產生提示特定評估的能力遵循類似於 LLM 本身觀察到的冪律縮放。2025 年 1 月,我們根據此方法訓練的路由器在 Chatbot Arena 排行榜中獲得了第一名。我們的程式碼可在 GitHub 連結取得:https://github.com/lmarena/p2l。 -摘要:解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰,特別是應用於像醫學和法律等敏感情境時。然而,解釋決策背後理由的需求也是基於人類的考量的一個主要問題,因為有必要證明為什麼做出某個決策。例如,住院醫師不僅需要提供(可能是正確的)診斷,還需要解釋他們如何達成某個結論。因此,開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中,我們遵循這個方向,並且根據我們的了解,提出第一個多語言醫學問答資料集,其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成(即前提、主張)和論證關係(即攻擊、支持)進行手動註解,產生多語言 CasiMedicos-Arg 資料集,其中包含 558 個具有解釋的四種語言(英語、西班牙語、法語、義大利語)的臨床病例,我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。 +##### **CLIPPER: Compression enables long-context synthetic data generation** +2502.14854v1 by Chau Minh Pham, Yapei Chang, Mohit Iyyer -##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration** -2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu +LLM developers are increasingly reliant on synthetic data, but generating +high-quality data for complex long-context reasoning tasks remains challenging. +We introduce CLIPPER, a compression-based approach for generating synthetic +data tailored to narrative claim verification - a task that requires reasoning +over a book to verify a given claim. Instead of generating claims directly from +the raw text of the book, which results in artifact-riddled claims, CLIPPER +first compresses the book into chapter outlines and book summaries and then +uses these intermediate representations to generate complex claims and +corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces +claims that are more valid, grounded, and complex. Using CLIPPER, we construct +a dataset of 19K synthetic book claims paired with their source texts and +chain-of-thought reasoning, and use it to fine-tune three open-weight models. +Our best model achieves breakthrough results on narrative claim verification +(from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for +sub-10B models on the NoCha leaderboard. Further analysis shows that our models +generate more detailed and grounded chain-of-thought reasoning while also +improving performance on other narrative understanding tasks (e.g., +NarrativeQA). -Diagnosis prediction is a critical task in healthcare, where timely and -accurate identification of medical conditions can significantly impact patient -outcomes. Traditional machine learning and deep learning models have achieved -notable success in this domain but often lack interpretability which is a -crucial requirement in clinical settings. In this study, we explore the use of -neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop -explainable models for diagnosis prediction. Essentially, we design and -implement LNN-based models that integrate domain-specific knowledge through -logical rules with learnable thresholds. Our models, particularly -$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior -performance over traditional models such as Logistic Regression, SVM, and -Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up -to 0.8457) in the case study of diabetes prediction. The learned weights and -thresholds within the LNN models provide direct insights into feature -contributions, enhancing interpretability without compromising predictive -power. These findings highlight the potential of neuro-symbolic approaches in -bridging the gap between accuracy and explainability in healthcare AI -applications. By offering transparent and adaptable diagnostic models, our work -contributes to the advancement of precision medicine and supports the -development of equitable healthcare solutions. Future research will focus on -extending these methods to larger and more diverse datasets to further validate -their applicability across different medical conditions and populations. +摘要:LLM 開發人員越來越依賴合成資料,但為複雜的長語境推理任務生成高品質資料仍然具有挑戰性。我們引入了 CLIPPER,一種基於壓縮的方法,用於生成針對敘事性聲明驗證量身打造的合成資料,這項任務需要對一本書進行推理才能驗證給定的聲明。CLIPPER 沒有直接從書籍的原始文字生成聲明,這會產生充滿人工製品的聲明,而是先將書籍壓縮成章節大綱和書籍摘要,然後使用這些中間表示來生成複雜的聲明和對應的思維鏈。與天真的方法相比,CLIPPER 產生的聲明更有效、更有根據且更複雜。使用 CLIPPER,我們構建了一個包含 19K 個合成書籍聲明及其原始文字和思維鏈推理的資料集,並用於微調三個開放權重模型。我們最好的模型在敘事性聲明驗證方面取得了突破性的結果(在我們的測試集中準確率從 28% 提升到 76%),並在 NoCha 排行榜上為低於 10B 的模型設定了新的技術水準。進一步的分析表明,我們的模型生成了更詳細且有根據的思維鏈推理,同時也提高了其他敘事理解任務(例如 NarrativeQA)的效能。 -摘要:診斷預測是醫療保健中的關鍵任務,及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功,但通常缺乏可解釋性,這在臨床環境中是一項關鍵要求。在本研究中,我們探討了神經符號方法的應用,特別是邏輯神經網路 (LNN),以開發用於診斷預測的可解釋模型。基本上,我們設計並實作了基於 LNN 的模型,這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型,特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$,表現出優於傳統模型(例如邏輯迴歸、SVM 和隨機森林)的優異效能,在糖尿病預測的案例研究中達到了更高的準確度(高達 80.52%)和 AUROC 分數(高達 0.8457)。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解,增強了可解釋性,同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型,我們的研究有助於推進精準醫療,並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集,以進一步驗證其在不同醫療狀況和人群中的適用性。 +##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** +2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu -##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare** -2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty +Large Language Models (LLMs) have shown great promise in tool-making, yet +existing frameworks often struggle to efficiently construct reliable toolsets +and are limited to single-task settings. To address these challenges, we +propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that +dynamically constructs and evolves a hierarchical graph of reusable tools +across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), +agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, +TabMWP). Our results show that GATE achieves up to 4.3x faster milestone +completion in Minecraft compared to the previous SOTA, and provides an average +improvement of 9.23% over existing tool-making methods in code generation tasks +and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, +balancing tool quantity, complexity, and functionality while maintaining high +efficiency. Code and data are available at +\url{https://github.com/ayanami2003/GATE}. -The rapid advancements in artificial intelligence (AI) have revolutionized -smart healthcare, driving innovations in wearable technologies, continuous -monitoring devices, and intelligent diagnostic systems. However, security, -explainability, robustness, and performance optimization challenges remain -critical barriers to widespread adoption in clinical environments. This -research presents an innovative algorithmic method using the Adaptive Feature -Evaluator (AFE) algorithm to improve feature selection in healthcare datasets -and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable -Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT), -the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby -enhancing predictive accuracy and interpretability. The proposed method is -validated across three diverse healthcare datasets using six distinct machine -learning algorithms, demonstrating its robustness and superiority over -conventional feature selection techniques. The results underscore the -transformative potential of AFE in smart healthcare, enabling personalized and -transparent patient care. Notably, the AFE algorithm, when combined with a -Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting -its capability to improve clinical decision-making processes in real-world -healthcare applications. +摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 -摘要:人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健,推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而,安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法,使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT),該演算法最佳化了臨床決策支援系統 (CDSS),從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集,證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力,實現了個人化和透明的患者照護。值得注意的是,AFE 演算法與多層感知器 (MLP) 結合使用時,準確度高達 98.5%,突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。 +##### **Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation** +2502.14846v1 by Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark -##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study** -2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker +Reasoning about images with rich text, such as charts and documents, is a +critical application of vision-language models (VLMs). However, VLMs often +struggle in these domains due to the scarcity of diverse text-rich +vision-language data. To address this challenge, we present CoSyn, a framework +that leverages the coding capabilities of text-only large language models +(LLMs) to automatically create synthetic text-rich multimodal data. Given input +text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts +an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic +images. With the underlying code as textual representations of the synthetic +images, CoSyn can generate high-quality instruction-tuning data, again relying +on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K +images and 2.7M rows of vision-language instruction-tuning data. Comprehensive +experiments on seven benchmarks demonstrate that models trained on our +synthetic data achieve state-of-the-art performance among competitive +open-source models, including Llama 3.2, and surpass proprietary models such as +GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing +data, enabling VLMs to ground information within input images, showcasing its +potential for developing multimodal agents capable of acting in real-world +environments. -Artificial intelligence (AI) systems have substantially improved -dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI) -systems further enhancing clinicians' confidence and trust in AI-driven -decisions. Despite these advancements, there remains a critical need for -objective evaluation of how dermatologists engage with both AI and XAI tools. -In this study, 76 dermatologists participated in a reader study, diagnosing 16 -dermoscopic images of melanomas and nevi using an XAI system that provides -detailed, domain-specific explanations. Eye-tracking technology was employed to -assess their interactions. Diagnostic performance was compared with that of a -standard AI system lacking explanatory features. Our findings reveal that XAI -systems improved balanced diagnostic accuracy by 2.8 percentage points relative -to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and -complex lesions were associated with elevated cognitive load, as evidenced by -increased ocular fixations. These insights have significant implications for -clinical practice, the design of AI tools for visual tasks, and the broader -development of XAI in medical diagnostics. +摘要:透過豐富文字(例如圖表和文件)對影像進行推理,是視覺語言模型 (VLM) 的重要應用。然而,由於多元化文字豐富的視覺語言資料稀少,VLM 在這些領域中經常會遇到困難。為了應對這個挑戰,我們提出了 CoSyn,一個利用純文字大型語言模型 (LLM) 的編碼能力來自動建立合成文字豐富多模態資料的架構。給定描述目標網域的輸入文字(例如「營養成分標籤」),CoSyn 會提示 LLM 產生用於合成影像渲染的程式碼(Python、HTML、LaTeX 等)。透過將底層程式碼作為合成影像的文字表示,CoSyn 可以產生高品質的指令調整資料,再次依賴純文字 LLM。使用 CoSyn,我們建構了一個包含 40 萬張影像和 270 萬列視覺語言指令調整資料的資料集。在七個基準上的全面實驗證明,在我們的合成資料上訓練的模型在競爭對手的開源模型(包括 Llama 3.2)中達到了最先進的效能,並超越了 GPT-4V 和 Gemini 1.5 Flash 等專有模型。此外,CoSyn 可以產生合成指向資料,讓 VLM 能在輸入影像中建立資訊基礎,展示其在開發能夠在真實世界環境中運作的多模態代理方面的潛力。 -摘要:人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度,而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展,對於皮膚科醫師如何使用 AI 和 XAI 工具,仍有客觀評估的迫切需求。在這項研究中,76 位皮膚科醫師參與了一項讀者研究,使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像,該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示,XAI 系統相較於標準 AI,將平衡診斷準確度提升了 2.8 個百分點。此外,與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關,這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。 +##### **Revealing and Mitigating Over-Attention in Knowledge Editing** +2502.14838v1 by Pinzheng Wang, Zecheng Tang, Keyan Zhou, Juntao Li, Qiaoming Zhu, Min Zhang -##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data** -2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar +Large Language Models have demonstrated superior performance across a wide +range of tasks, but they still exhibit undesirable errors due to incorrect +knowledge learned from the training data. To avoid this, knowledge editing +methods emerged to precisely edit the specific model knowledge via efficiently +modifying a very small percentage of parameters. % However, those methods can +lead to the problem of Specificity Failure: when the content related to the +edited knowledge occurs in the context, it can inadvertently corrupt other +pre-existing knowledge. However, those methods can lead to the problem of +Specificity Failure, where the existing knowledge and capabilities are severely +degraded due to editing. Our preliminary indicates that Specificity Failure +primarily stems from the model's attention heads assigning excessive attention +scores to entities related to the edited knowledge, thereby unduly focusing on +specific snippets within the context, which we denote as the Attention Drift +phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet +effective method Selective Attention Drift Restriction}(SADR), which introduces +an additional regularization term during the knowledge editing process to +restrict changes in the attention weight distribution, thereby preventing undue +focus on the edited entity. Experiments on five frequently used strong LLMs +demonstrate the effectiveness of our method, where SADR can significantly +mitigate Specificity Failure in the predominant knowledge editing tasks. -Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been -shown to significantly improve the quality of life of autistic individuals. -However, diagnostics methods for ASD rely on assessments based on clinical -presentation that are prone to bias and can be challenging to arrive at an -early diagnosis. There is a need for objective biomarkers of ASD which can help -improve diagnostic accuracy. Deep learning (DL) has achieved outstanding -performance in diagnosing diseases and conditions from medical imaging data. -Extensive research has been conducted on creating models that classify ASD -using resting-state functional Magnetic Resonance Imaging (fMRI) data. However, -existing models lack interpretability. This research aims to improve the -accuracy and interpretability of ASD diagnosis by creating a DL model that can -not only accurately classify ASD but also provide explainable insights into its -working. The dataset used is a preprocessed version of the Autism Brain Imaging -Data Exchange (ABIDE) with 884 samples. Our findings show a model that can -accurately classify ASD and highlight critical brain regions differing between -ASD and typical controls, with potential implications for early diagnosis and -understanding of the neural basis of ASD. These findings are validated by -studies in the literature that use different datasets and modalities, -confirming that the model actually learned characteristics of ASD and not just -the dataset. This study advances the field of explainable AI in medical imaging -by providing a robust and interpretable model, thereby contributing to a future -with objective and reliable ASD diagnostics. +摘要:大型語言模型已在廣泛任務中展現出卓越的效能,但由於從訓練資料中學習到不正確的知識,它們仍會出現令人不滿意的錯誤。為避免此情況,知識編輯方法應運而生,透過有效修改極少數參數來精準編輯特定模型知識。% 然而,這些方法可能會導致特異性失敗問題:當與已編輯知識相關的內容出現在文中時,可能會無意間損害其他既有知識。然而,這些方法可能會導致特異性失敗問題,因為現有知識和能力會因編輯而嚴重降低。我們的初步研究表明,特異性失敗主要源於模型的注意力權重將過度注意力分數分配給與已編輯知識相關的實體,從而過度關注文中特定的片段,我們將此現象稱為注意力偏移。為減輕這種注意力偏移問題,我們引入了一個簡單但有效的方法選擇性注意力偏移限制}(SADR),在知識編輯過程中引入一個額外的正則化項來限制注意力權重分配的變動,從而防止過度關注已編輯實體。在五個經常使用的強大 LLM 上進行的實驗證明了我們方法的有效性,其中 SADR 可以顯著減輕主要知識編輯任務中的特異性失敗。 -摘要:自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而,ASD 的診斷方法依賴於基於臨床表現的評估,容易產生偏見,且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記,以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而,現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD,還能提供可解釋見解說明其運作原理的 DL 模型,來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本,包含 884 個樣本。我們的研究結果顯示,該模型能準確分類 ASD,並強調 ASD 與典型對照組之間存在差異的關鍵腦區,對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證,證實該模型實際上學習了 ASD 的特徵,而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型,推動了醫學影像中可解釋 AI 的領域,從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。 +##### **Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs** +2502.14837v1 by Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui -##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition** -2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul +Multi-head Latent Attention (MLA) is an innovative architecture proposed by +DeepSeek, designed to ensure efficient and economical inference by +significantly compressing the Key-Value (KV) cache into a latent vector. +Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its +variants such as Grouped-Query Attention (GQA) exhibit significant cost +disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA +without pre-training from scratch is both meaningful and challenging. This +paper proposes the first data-efficient fine-tuning method for transitioning +from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, +we remove RoPE from dimensions of queries and keys that contribute less to the +attention scores, for low-rank approximation, we introduce joint SVD +approximations based on the pre-trained parameters of keys and values. These +carefully designed strategies enable MHA2MLA to recover performance using only +a small fraction (0.3% to 0.6%) of the data, significantly reducing inference +costs while seamlessly integrating with compression techniques such as KV cache +quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, +with only a 0.5% drop in LongBench performance. -The in-vivo identification of the kidney stone types during an ureteroscopy -would be a major medical advance in urology, as it could reduce the time of the -tedious renal calculi extraction process, while diminishing infection risks. -Furthermore, such an automated procedure would make possible to prescribe -anti-recurrence treatments immediately. Nowadays, only few experienced -urologists are able to recognize the kidney stone types in the images of the -videos displayed on a screen during the endoscopy. Thus, several deep learning -(DL) models have recently been proposed to automatically recognize the kidney -stone types using ureteroscopic images. However, these DL models are of black -box nature whicl limits their applicability in clinical settings. This -contribution proposes a case-based reasoning DL model which uses prototypical -parts (PPs) and generates local and global descriptors. The PPs encode for each -class (i.e., kidney stone type) visual feature information (hue, saturation, -intensity and textures) similar to that used by biologists. The PPs are -optimally generated due a new loss function used during the model training. -Moreover, the local and global descriptors of PPs allow to explain the -decisions ("what" information, "where in the images") in an understandable way -for biologists and urologists. The proposed DL model has been tested on a -database including images of the six most widespread kidney stone types. The -overall average classification accuracy was 90.37. When comparing this results -with that of the eight other DL models of the kidney stone state-of-the-art, it -can be seen that the valuable gain in explanability was not reached at the -expense of accuracy which was even slightly increased with respect to that -(88.2) of the best method of the literature. These promising and interpretable -results also encourage urologists to put their trust in AI-based solutions. +摘要:多頭潛在注意力 (MLA) 是 DeepSeek 提出的一種創新架構,旨在通過將鍵值 (KV) 快取大幅壓縮成潛在向量,確保有效率且經濟的推論。與 MLA 相比,採用多頭注意力 (MHA) 及其變體(例如分組查詢注意力 (GQA))的標準 LLM 會出現顯著的成本劣勢。讓訓練完善的 LLM(例如 Llama)能夠快速適應 MLA,而無需從頭開始預訓練,這既有意義又具有挑戰性。本文提出了第一個資料有效微調方法,用於從 MHA 轉換到 MLA (MHA2MLA),其中包含兩個關鍵組成部分:對於部分 RoPE,我們從查詢和鍵的維度中移除對注意力分數貢獻較小的 RoPE,對於低秩近似,我們基於鍵和值的預訓練參數引入聯合 SVD 近似。這些經過仔細設計的策略讓 MHA2MLA 能夠僅使用一小部分資料 (0.3% 至 0.6%) 來恢復效能,大幅降低推論成本,同時與壓縮技術(例如 KV 快取量化)無縫整合。例如,Llama2-7B 的 KV 快取大小減少了 92.19%,而 LongBench 效能僅下降了 0.5%。 -摘要:尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展,因為它可以減少繁瑣的腎結石取出過程的時間,同時降低感染風險。此外,這種自動化程序將使立即開立抗復發治療成為可能。如今,只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此,最近已提出多種深度學習 (DL) 模型,以使用輸尿管鏡圖像自動識別腎結石類型。然而,這些 DL 模型本質上是黑盒子,這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型,它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型(即腎結石類型)編碼視覺特徵信息(色調、飽和度、強度和紋理),類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數,PP 得到了最佳生成。此外,PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策(“什麼”信息,“圖像中的什麼位置”)。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時,可以看出,可解釋性的寶貴增益並未以準確性為代價,甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。 +##### **LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models** +2502.14834v1 by Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li -##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques** -2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman +Existing Large Vision-Language Models (LVLMs) can process inputs with context +lengths up to 128k visual and text tokens, yet they struggle to generate +coherent outputs beyond 1,000 words. We find that the primary limitation is the +absence of long output examples during supervised fine-tuning (SFT). To tackle +this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 +examples, each with multiple input images, an instruction, and corresponding +outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that +maintain high-fidelity to the input images, we employ Direct Preference +Optimization (DPO) to the SFT model. Given the high cost of collecting human +feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which +breaks long outputs into segments and uses iterative corrections to form +preference pairs with the original outputs. Additionally, we develop +MMLongBench-Write, a benchmark featuring six tasks to evaluate the +long-generation capabilities of VLMs. Our 7B parameter model, trained with +LongWriter-V-22k and IterDPO, achieves impressive performance on this +benchmark, outperforming larger proprietary models like GPT-4o. Code and data: +https://github.com/THU-KEG/LongWriter-V -This study explores the potential of utilizing administrative claims data, -combined with advanced machine learning and deep learning techniques, to -predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal -Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major -health insurance organization to develop prediction models for multiple -observation windows using traditional machine learning methods such as Random -Forest and XGBoost as well as deep learning approaches such as Long Short-Term -Memory (LSTM) networks. Our findings demonstrate that the LSTM model, -particularly with a 24-month observation window, exhibits superior performance -in predicting ESRD progression, outperforming existing models in the -literature. We further apply SHapley Additive exPlanations (SHAP) analysis to -enhance interpretability, providing insights into the impact of individual -features on predictions at the individual patient level. This study underscores -the value of leveraging administrative claims data for CKD management and -predicting ESRD progression. +摘要:現有的大型視覺語言模型 (LVLMs) 能處理長度達 128k 視覺和文字符號的輸入內容,但卻難以產生超過 1,000 字的連貫輸出。我們發現,主要限制在於監督微調 (SFT) 期間缺少長輸出範例。為了解決此問題,我們引入了 LongWriter-V-22k,這是一個 SFT 資料集,包含 22,158 個範例,每個範例都有多個輸入影像、一個說明和對應的輸出,範圍從 0 到 10,000 字。此外,為了產生與輸入影像高度保真的長輸出,我們對 SFT 模型採用直接偏好最佳化 (DPO)。考量到收集人類回饋的成本很高(例如 3,000 字),我們提出 IterDPO,它會將長輸出區分成幾個區塊,並使用反覆修正來形成與原始輸出的偏好配對。此外,我們開發了 MMLongBench-Write,這是一個基準,包含六項任務,用於評估 VLM 的長生成能力。我們的 7B 參數模型使用 LongWriter-V-22k 和 IterDPO 進行訓練,在這個基準上取得令人印象深刻的效能,超越了 GPT-4o 等大型專有模型。程式碼和資料:https://github.com/THU-KEG/LongWriter-V -摘要:本研究探討利用行政申報資料,結合先進機器學習與深度學習技術,預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集,使用傳統機器學習方法(例如隨機森林和 XGBoost)以及深度學習方法(例如長期短期記憶 (LSTM) 網路)開發多個觀察視窗的預測模型。我們的研究結果顯示,LSTM 模型(尤其是 24 個月觀察視窗)在預測 ESRD 進展方面表現優異,優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性,深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。 +##### **Improving the Diffusability of Autoencoders** +2502.14831v1 by Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin -##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases** -2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller +Latent diffusion models have emerged as the leading approach for generating +high-quality images and videos, utilizing compressed latent representations to +reduce the computational burden of the diffusion process. While recent +advancements have primarily focused on scaling diffusion backbones and +improving autoencoder reconstruction quality, the interaction between these +components has received comparatively less attention. In this work, we perform +a spectral analysis of modern autoencoders and identify inordinate +high-frequency components in their latent spaces, which are especially +pronounced in the autoencoders with a large bottleneck channel size. We +hypothesize that this high-frequency component interferes with the +coarse-to-fine nature of the diffusion synthesis process and hinders the +generation quality. To mitigate the issue, we propose scale equivariance: a +simple regularization strategy that aligns latent and RGB spaces across +frequencies by enforcing scale equivariance in the decoder. It requires minimal +code changes and only up to 20K autoencoder fine-tuning steps, yet +significantly improves generation quality, reducing FID by 19% for image +generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation +on Kinetics-700 17x256x256. -While large language models (LLMs) have shown promise for medical question -answering, there is limited work focused on tropical and infectious -disease-specific exploration. We build on an opensource tropical and infectious -diseases (TRINDs) dataset, expanding it to include demographic and semantic -clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM -performance on these, comparing generalist and medical LLMs, as well as LLM -outcomes to human experts. We demonstrate through systematic experimentation, -the benefit of contextual information such as demographics, location, gender, -risk factors for optimal LLM response. Finally we develop a prototype of -TRINDs-LM, a research tool that provides a playground to navigate how context -impacts LLM outputs for health. +摘要:潛在擴散模型已成為生成高品質影像和影片的主流方法,利用壓縮潛在表示來降低擴散過程的計算負擔。雖然近期的進展主要集中在擴充擴散主幹並提升自編碼器重建品質,但這些組成之間的交互作用卻鮮少受到關注。在這項研究中,我們對現代自編碼器進行頻譜分析,並在它們的潛在空間中找出不適當的高頻率組成,這在瓶頸通道尺寸較大的自編碼器中特別明顯。我們假設這種高頻率組成會干擾擴散合成過程由粗到細的性質,並阻礙生成品質。為了緩解這個問題,我們提出規模等變性:一種簡單的正則化策略,透過在解碼器中強制執行規模等變性,使潛在空間和 RGB 空間在各個頻率中保持一致。它只需要最小的程式碼變更,且僅需最多 20K 個自編碼器微調步驟,就能顯著提升生成品質,將 ImageNet-1K 256x256 上的影像生成的 FID 降低 19%,並將 Kinetics-700 17x256x256 上的影片生成的 FVD 降低至少 44%。 -摘要:儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景,但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上,並將其擴展為納入人口統計和語義臨床和消費者擴充,產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能,比較了通才和醫療 LLM,以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊(例如人口統計、位置、性別、最佳 LLM 回應的風險因素)的好處。最後,我們開發了 TRINDs-LM 的原型,這是一個研究工具,提供一個探索背景如何影響 LLM 健康輸出的平台。 +##### **Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs** +2502.14830v1 by Danni Liu, Jan Niehues -##### **Explainable AI: Definition and attributes of a good explanation for health AI** -2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group +While large language models demonstrate remarkable capabilities at +task-specific applications through fine-tuning, extending these benefits across +diverse languages is essential for broad accessibility. However, effective +cross-lingual transfer is hindered by LLM performance gaps across languages and +the scarcity of fine-tuning data in many languages. Through analysis of LLM +internal representations from over 1,000+ language pairs, we discover that +middle layers exhibit the strongest potential for cross-lingual alignment. +Building on this finding, we propose a middle-layer alignment objective +integrated into task-specific training. Our experiments on slot filling, +machine translation, and structured text generation show consistent +improvements in cross-lingual transfer, especially to lower-resource languages. +The method is robust to the choice of alignment languages and generalizes to +languages unseen during alignment. Furthermore, we show that separately trained +alignment modules can be merged with existing task-specific modules, improving +cross-lingual capabilities without full re-training. Our code is publicly +available (https://github.com/dannigt/mid-align). -Proposals of artificial intelligence (AI) solutions based on increasingly -complex and accurate predictive models are becoming ubiquitous across many -disciplines. As the complexity of these models grows, transparency and users' -understanding often diminish. This suggests that accurate prediction alone is -insufficient for making an AI-based solution truly useful. In the development -of healthcare systems, this introduces new issues related to accountability and -safety. Understanding how and why an AI system makes a recommendation may -require complex explanations of its inner workings and reasoning processes. -Although research on explainable AI (XAI) has significantly increased in recent -years and there is high demand for XAI in medicine, defining what constitutes a -good explanation remains ad hoc, and providing adequate explanations continues -to be challenging. To fully realize the potential of AI, it is critical to -address two fundamental questions about explanations for safety-critical AI -applications, such as health-AI: (1) What is an explanation in health-AI? and -(2) What are the attributes of a good explanation in health-AI? In this study, -we examined published literature and gathered expert opinions through a -two-round Delphi study. The research outputs include (1) a definition of what -constitutes an explanation in health-AI and (2) a comprehensive list of -attributes that characterize a good explanation in health-AI. +摘要:儘管大型語言模型在特定任務應用中透過微調展現出卓越的能力,但要讓這些好處擴及各種語言,對於廣泛的可及性來說至關重要。然而,有效的跨語言轉移受到跨語言 LLM 效能差距以及許多語言中微調資料的稀少性所阻礙。透過分析來自 1,000 多種語言對的 LLM 內部表示,我們發現中間層展現出最強的跨語言對齊潛力。根據這個發現,我們提出一個整合到特定任務訓練中的中間層對齊目標。我們在插槽填補、機器翻譯和結構化文字生成方面的實驗顯示,跨語言轉移持續改善,特別是對於低資源語言。此方法對於對齊語言的選擇具有穩健性,並推廣到對齊期間未曾見過的語言。此外,我們展示了單獨訓練的對齊模組可以與現有的特定任務模組合併,在不重新訓練的情況下改善跨語言能力。我們的程式碼已公開(https://github.com/dannigt/mid-align)。 -摘要:隨著越來越複雜且準確的預測模型,基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加,透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中,這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加,且醫學領域對 XAI 有很高的需求,但定義什麼構成一個好的解釋仍是臨時性的,而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力,對於安全關鍵型 AI 應用(例如健康 AI)的解釋,探討兩個基本問題至關重要:(1) 什麼是健康 AI 中的解釋?以及 (2) 健康 AI 中一個好的解釋有哪些屬性?在本研究中,我們檢視了已發表的文獻,並透過兩輪德爾菲研究收集了專家意見。研究成果包括:(1) 健康 AI 中什麼構成解釋的定義,以及 (2) 健康 AI 中一個好解釋的屬性清單。 +##### **Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps** +2502.14829v1 by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov -##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare** -2408.17401v2 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni +When prompted to think step-by-step, language models (LMs) produce a chain of +thought (CoT), a sequence of reasoning steps that the model supposedly used to +produce its prediction. However, despite much work on CoT prompting, it is +unclear if CoT reasoning is faithful to the models' parameteric beliefs. We +introduce a framework for measuring parametric faithfulness of generated +reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an +instance of this framework. FUR erases information contained in reasoning steps +from model parameters. We perform experiments unlearning CoTs of four LMs +prompted on four multi-choice question answering (MCQA) datasets. Our +experiments show that FUR is frequently able to change the underlying models' +prediction by unlearning key steps, indicating when a CoT is parametrically +faithful. Further analysis shows that CoTs generated by models post-unlearning +support different answers, hinting at a deeper effect of unlearning. +Importantly, CoT steps identified as important by FUR do not align well with +human notions of plausbility, emphasizing the need for specialized alignment -AI-driven tools for healthcare are widely acknowledged as potentially -beneficial to health practitioners and patients, e.g. the QCancer regression -tool for cancer risk prediction. However, for these tools to be trusted, they -need to be supplemented with explanations. We examine how explanations' content -and format affect user comprehension and trust when explaining QCancer's -predictions. Regarding content, we deploy SHAP and Occlusion-1. Regarding -format, we present SHAP explanations, conventionally, as charts (SC) and -Occlusion-1 explanations as charts (OC) as well as text (OT), to which their -simpler nature lends itself. We conduct experiments with two sets of -stakeholders: the general public (representing patients) and medical students -(representing healthcare practitioners). Our experiments showed higher -subjective comprehension and trust for Occlusion-1 over SHAP explanations based -on content. However, when controlling for format, only OT outperformed SC, -suggesting this trend is driven by preferences for text. Other findings -corroborated that explanation format, rather than content, is often the -critical factor. +摘要:当提示逐步思考时,语言模型 (LM) 会产生一系列思考 (CoT),这是模型用来产生预测的一系列推理步骤。然而,尽管在 CoT 提示上做了很多工作,但尚不清楚 CoT 推理是否符合模型的参数化信念。我们引入了一个框架来衡量生成推理的参数化保真度,并提出了通过取消学习推理步骤 (FUR) 的保真度,这是该框架的一个实例。FUR 从模型参数中擦除推理步骤中包含的信息。我们执行实验,取消学习提示在四个多项选择问答 (MCQA) 数据集上的四个 LM 的 CoT。我们的实验表明,FUR 经常能够通过取消学习关键步骤来改变底层模型的预测,表明 CoT 在参数上是保真的。进一步的分析表明,模型在取消学习后生成的 CoT 支持不同的答案,暗示取消学习具有更深层次的影响。重要的是,FUR 确定的 CoT 步骤与人类对合理性的概念不太一致,强调了专门对齐的必要性 -摘要:由 AI 驅動的醫療保健工具被廣泛認為對醫療從業者和患者有潛在好處,例如用於癌症風險預測的 QCancer 回歸工具。然而,對於這些工具,如果要讓人們信賴,就需要補充說明。我們研究了說明的內容和格式如何影響使用者在解釋 QCancer 預測時的理解和信任。關於內容,我們部署了 SHAP 和 Occlusion-1。關於格式,我們以圖表 (SC) 的形式呈現 SHAP 說明,以圖表 (OC) 和文字 (OT) 的形式呈現 Occlusion-1 說明,因為它們的性質較為簡單。我們對兩組利害關係人進行了實驗:一般民眾(代表患者)和醫學生(代表醫療從業者)。我們的實驗結果顯示,基於內容,Occlusion-1 比 SHAP 說明具有更高的主觀理解和信任。然而,在控制格式時,只有 OT 優於 SC,這表明這種趨勢是由對文字的偏好所驅動的。其他發現證實了說明格式,而不是內容,通常是關鍵因素。 +##### **Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison** +2502.14827v1 by Aiswarya Baby, Tintu Thankom Koshy -##### **A Survey for Large Language Models in Biomedicine** -2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen +Visual Question Answering (VQA) has emerged as a pivotal task in the +intersection of computer vision and natural language processing, requiring +models to understand and reason about visual content in response to natural +language questions. Analyzing VQA datasets is essential for developing robust +models that can handle the complexities of multimodal reasoning. Several +approaches have been developed to examine these datasets, each offering +distinct perspectives on question diversity, answer distribution, and +visual-textual correlations. Despite significant progress, existing VQA models +face challenges related to dataset bias, limited model complexity, commonsense +reasoning gaps, rigid evaluation methods, and generalization to real world +scenarios. This paper presents a comprehensive comparative study of five +advanced VQA models: ABC-CNN, KICNLE, Masked Vision and Language Modeling, +BLIP-2, and OFA, each employing distinct methodologies to address these +challenges. -Recent breakthroughs in large language models (LLMs) offer unprecedented -natural language understanding and generation capabilities. However, existing -surveys on LLMs in biomedicine often focus on specific applications or model -architectures, lacking a comprehensive analysis that integrates the latest -advancements across various biomedical domains. This review, based on an -analysis of 484 publications sourced from databases including PubMed, Web of -Science, and arXiv, provides an in-depth examination of the current landscape, -applications, challenges, and prospects of LLMs in biomedicine, distinguishing -itself by focusing on the practical implications of these models in real-world -biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot -learning across a broad spectrum of biomedical tasks, including diagnostic -assistance, drug discovery, and personalized medicine, among others, with -insights drawn from 137 key studies. Then, we discuss adaptation strategies of -LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to -enhance their performance in specialized biomedical contexts where zero-shot -fails to achieve, such as medical question answering and efficient processing -of biomedical literature. Finally, we discuss the challenges that LLMs face in -the biomedicine domain including data privacy concerns, limited model -interpretability, issues with dataset quality, and ethics due to the sensitive -nature of biomedical data, the need for highly reliable model outputs, and the -ethical implications of deploying AI in healthcare. To address these -challenges, we also identify future research directions of LLM in biomedicine -including federated learning methods to preserve data privacy and integrating -explainable AI methodologies to enhance the transparency of LLMs. +摘要:視覺問答 (VQA) 已成為電腦視覺與自然語言處理交會中的關鍵任務,要求模型理解和推理視覺內容以回應自然語言問題。分析 VQA 資料集對於開發健全的模型至關重要,這些模型能夠處理多模態推理的複雜性。已經開發出多種方法來檢驗這些資料集,每種方法都提供有關問題多樣性、答案分佈和視覺文本關聯性的不同觀點。儘管有顯著進展,現有的 VQA 模型仍面臨與資料集偏差、模型複雜性有限、常識推理差距、僵化的評估方法和推廣到現實世界場景相關的挑戰。本文對五個先進的 VQA 模型進行了全面的比較研究:ABC-CNN、KICNLE、Masked Vision and Language Modeling、BLIP-2 和 OFA,每個模型都採用不同的方法來應對這些挑戰。 -摘要:大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而,現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構,缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析,深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景,其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先,我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力,包括診斷輔助、藥物發現和個性化醫療等,並從 137 項關鍵研究中汲取見解。然後,我們討論了 LLM 的適應策略,包括單模態和多模態 LLM 的微調方法,以增強它們在零次學習無法實現的專業生物醫學背景中的性能,例如醫療問題解答和生物醫學文獻的有效處理。最後,我們討論了 LLM 在生物醫學領域面臨的挑戰,包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰,我們還確定了生物醫學中 LLM 未來的研究方向,包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。 +##### **eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables** +2502.14820v1 by Luis Antonio Gutiérrez Guanilo, Mir Tafseer Nayeem, Cristian López, Davood Rafiei -##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis** -2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone +Large Language Models (LLMs) have demonstrated exceptional versatility across +diverse domains, yet their application in e-commerce remains underexplored due +to a lack of domain-specific datasets. To address this gap, we introduce +eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, +including detailed product attributes and user-specific queries. Leveraging +eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to +produce high-quality, attribute-specific product reviews from structured +tabular data. Fine-tuned models were rigorously evaluated using standard +Table2Text metrics, alongside correctness, faithfulness, and fluency +assessments. Our results demonstrate substantial improvements in generating +contextually accurate reviews, highlighting the transformative potential of +tailored datasets and fine-tuning methodologies in optimizing e-commerce +workflows. This work highlights the potential of LLMs in e-commerce workflows +and the essential role of domain-specific datasets in tailoring them to +industry-specific challenges. -Significant investment and development have gone into integrating Artificial -Intelligence (AI) in medical and healthcare applications, leading to advanced -control systems in medical technology. However, the opacity of AI systems -raises concerns about essential characteristics needed in such sensitive -applications, like transparency and trustworthiness. Our study addresses these -concerns by investigating a process for selecting the most adequate Explainable -AI (XAI) methods to comply with the explanation requirements of key EU -regulations in the context of smart bioelectronics for medical devices. The -adopted methodology starts with categorising smart devices by their control -mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving -into their technology. Then, we analyse these regulations to define their -explainability requirements for the various devices and related goals. -Simultaneously, we classify XAI methods by their explanatory objectives. This -allows for matching legal explainability requirements with XAI explanatory -goals and determining the suitable XAI algorithms for achieving them. Our -findings provide a nuanced understanding of which XAI algorithms align better -with EU regulations for different types of medical devices. We demonstrate this -through practical case studies on different neural implants, from chronic -disease management to advanced prosthetics. This study fills a crucial gap in -aligning XAI applications in bioelectronics with stringent provisions of EU -regulations. It provides a practical framework for developers and researchers, -ensuring their AI innovations advance healthcare technology and adhere to legal -and ethical standards. +摘要:大型語言模型 (LLM) 在各種領域展現出非凡的多功能性,但由於缺乏特定領域的資料集,因此它們在電子商務中的應用仍未得到充分探索。為了解決這個差距,我們引入了 eC-Tab2Text,這是一個新穎的資料集,旨在捕捉電子商務的複雜性,包括詳細的產品屬性和使用者特定的查詢。利用 eC-Tab2Text,我們專注於從產品表格中產生文字,使 LLM 能夠從結構化的表格資料中產生高品質、特定屬性的產品評論。微調模型使用標準的 Table2Text 指標,以及正確性、忠實度和流利度評估進行嚴格評估。我們的結果證明在產生符合語境的準確評論方面有顯著的進步,突顯了客製化資料集和微調方法在最佳化電子商務工作流程中的轉型潛力。這項工作突顯了 LLM 在電子商務工作流程中的潛力,以及特定領域資料集在因應產業特定挑戰中至關重要的角色。 -摘要:人工智慧(AI)在醫療和保健應用中投入了大量的投資和開發,進而導致醫療技術中的先進控制系統。然而,AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂,例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題,用於選擇最充分的可解釋 AI(XAI)方法,以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制(開迴路、閉迴路和半閉迴路系統)對智慧型裝置進行分類,並深入探討其技術開始。然後,我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時,我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配,並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點,從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構,確保其 AI 創新能促進醫療技術並遵守法律和道德標準。 +##### **Optimizing Model Selection for Compound AI Systems** +2502.14815v1 by Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica -##### **Towards Case-based Interpretability for Medical Federated Learning** -2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva +Compound AI systems that combine multiple LLM calls, such as self-refine and +multi-agent-debate, achieve strong performance on many AI tasks. We address a +core question in optimizing compound systems: for each LLM call or module in +the system, how should one decide which LLM to use? We show that these LLM +choices have a large effect on quality, but the search space is exponential. We +propose LLMSelector, an efficient framework for model selection in compound +systems, which leverages two key empirical insights: (i) end-to-end performance +is often monotonic in how well each module performs, with all other modules +held fixed, and (ii) per-module performance can be estimated accurately by an +LLM. Building upon these insights, LLMSelector iteratively selects one module +and allocates to it the model with the highest module-wise performance, as +estimated by an LLM, until no further gain is possible. LLMSelector is +applicable to any compound system with a bounded number of modules, and its +number of API calls scales linearly with the number of modules, achieving +high-quality model allocation both empirically and theoretically. Experiments +with popular compound systems such as multi-agent debate and self-refine using +LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector +confers 5%-70% accuracy gains compared to using the same LLM for all modules. -We explore deep generative models to generate case-based explanations in a -medical federated learning setting. Explaining AI model decisions through -case-based interpretability is paramount to increasing trust and allowing -widespread adoption of AI in clinical practice. However, medical AI training -paradigms are shifting towards federated learning settings in order to comply -with data protection regulations. In a federated scenario, past data is -inaccessible to the current user. Thus, we use a deep generative model to -generate synthetic examples that protect privacy and explain decisions. Our -proof-of-concept focuses on pleural effusion diagnosis and uses publicly -available Chest X-ray data. +摘要:複合式 AI 系統結合多個 LLM 呼叫,例如自我精煉和多代理辯論,在許多 AI 任務中都能獲得強大的效能。我們解決了最佳化複合式系統中的核心問題:對於系統中的每個 LLM 呼叫或模組,應該如何決定要使用哪個 LLM?我們表明這些 LLM 選擇對品質有很大的影響,但搜尋空間是呈指數增長的。我們提出 LLMSelector,一種用於複合式系統中模型選擇的有效架構,它利用了兩個主要的經驗見解:(i) 端對端效能通常會隨著每個模組執行得有多好而單調變化,而其他所有模組保持固定,以及 (ii) 每個模組的效能都可以由 LLM 精準估計。LLMSelector 建立在這些見解之上,反覆選擇一個模組,並根據 LLM 估計的模組最佳效能,將模型分配給它,直到無法再進一步提升為止。LLMSelector 適用於任何具有有限數量的模組的複合式系統,其 API 呼叫數量與模組數量成線性比例,在經驗和理論上都實現了高品質的模型配置。使用 GPT-4o、Claude 3.5 Sonnet 和 Gemini 1.5 等 LLM,對多代理辯論和自我精煉等熱門複合式系統進行的實驗表明,與對所有模組使用相同的 LLM 相比,LLMSelector 可帶來 5%-70% 的準確度提升。 -摘要:我們探索深度生成模型,在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策,對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而,醫療 AI 訓練範例正轉向聯邦學習設置,以符合資料保護法規。在聯邦情境中,過去的資料對目前的使用者而言是無法取得的。因此,我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷,並使用公開可取得的胸部 X 光資料。 +##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** +2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub -##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines** -2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans +Foundation models are becoming increasingly effective in the medical domain, +offering pre-trained models on large datasets that can be readily adapted for +downstream tasks. Despite progress, fetal ultrasound images remain a +challenging domain for foundation models due to their inherent complexity, +often requiring substantial additional training and facing limitations due to +the scarcity of paired multimodal data. To overcome these challenges, here we +introduce FetalCLIP, a vision-language foundation model capable of generating +universal representation of fetal ultrasound images. FetalCLIP was pre-trained +using a multimodal learning approach on a diverse dataset of 210,035 fetal +ultrasound images paired with text. This represents the largest paired dataset +of its kind used for foundation model development to date. This unique training +approach allows FetalCLIP to effectively learn the intricate anatomical +features present in fetal ultrasound images, resulting in robust +representations that can be used for a variety of downstream applications. In +extensive benchmarking across a range of key fetal ultrasound applications, +including classification, gestational age estimation, congenital heart defect +(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all +baselines while demonstrating remarkable generalizability and strong +performance even with limited labeled data. We plan to release the FetalCLIP +model publicly for the benefit of the broader scientific community. -Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging -lesions with variable clinical behaviours and treatment approaches. This -systematic review provides an overview of Artificial Intelligence (AI) methods -using radiological imaging for diagnosis and prognosis of these tumours, -highlighting challenges in clinical translation, and evaluating study alignment -with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI -international consensus guidelines for trustworthy and deployable AI to promote -the clinical translation of AI methods. The review covered literature from -several bibliographic databases, including papers published before 17/07/2024. -Original research in peer-reviewed journals focused on radiology-based AI for -diagnosing or prognosing primary STBT was included. Exclusion criteria were -animal, cadaveric, or laboratory studies, and non-English papers. Abstracts -were screened by two of three independent reviewers for eligibility. Eligible -papers were assessed against guidelines by one of three independent reviewers. -The search identified 15,015 abstracts, from which 325 articles were included -for evaluation. Most studies performed moderately on CLAIM, averaging a score -of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out -of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage, -indicating significant room for improvement. Future efforts by AI developers -should focus on design (e.g. define unmet clinical need, intended clinical -setting and how AI would be integrated in clinical workflow), development (e.g. -build on previous work, explainability), evaluation (e.g. evaluating and -addressing biases, evaluating AI against best practices), and data -reproducibility and availability (making documented code and data publicly -available). Following these recommendations could improve clinical translation -of AI methods. +摘要:基礎模型在醫療領域正變得越來越有效, +提供在大型資料集上預先訓練的模型,可輕鬆適應 +下游任務。儘管有進展,但胎兒超音波影像仍然是 +基礎模型的挑戰領域,因為它們固有的複雜性, +通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 +介紹 FetalCLIP,一種能夠產生 +胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 +超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 +方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 +表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 +(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 +效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 -摘要:軟組織和骨骼腫瘤(STBT)是罕見、診斷具有挑戰性的病灶,其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀,重點說明了臨床轉譯的挑戰,並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性,以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻,包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究,以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要,其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等,平均得分為 53 分中的 28.9±7.5 分,但在 FUTURE-AI 中表現不佳,平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段,表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計(例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中)、開發(例如建立在先前的工作、可解釋性)、評估(例如評估和解決偏差、評估 AI 與最佳實務)、以及數據可複製性和可用性(公開提供文件化的代碼和數據)。遵循這些建議可以改善 AI 方法的臨床轉譯。 +##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** +2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su -##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy** -2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen +Our ability to continuously acquire, organize, and leverage knowledge is a +key feature of human intelligence that AI systems must approximate to unlock +their full potential. Given the challenges in continual learning with large +language models (LLMs), retrieval-augmented generation (RAG) has become the +dominant way to introduce new information. However, its reliance on vector +retrieval hinders its ability to mimic the dynamic and interconnected nature of +human long-term memory. Recent RAG approaches augment vector embeddings with +various structures like knowledge graphs to address some of these gaps, namely +sense-making and associativity. However, their performance on more basic +factual memory tasks drops considerably below standard RAG. We address this +unintended deterioration and propose HippoRAG 2, a framework that outperforms +standard RAG comprehensively on factual, sense-making, and associative memory +tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in +HippoRAG and enhances it with deeper passage integration and more effective +online use of an LLM. This combination pushes this RAG system closer to the +effectiveness of human long-term memory, achieving a 7% improvement in +associative memory tasks over the state-of-the-art embedding model while also +exhibiting superior factual knowledge and sense-making memory capabilities. +This work paves the way for non-parametric continual learning for LLMs. Our +code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. -Early detection of Cerebral Palsy (CP) is crucial for effective intervention -and monitoring. This paper tests the reliability and applicability of -Explainable AI (XAI) methods using a deep learning method that predicts CP by -analyzing skeletal data extracted from video recordings of infant movements. -Specifically, we use XAI evaluation metrics -- namely faithfulness and -stability -- to quantitatively assess the reliability of Class Activation -Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this -specific medical application. We utilize a unique dataset of infant movements -and apply skeleton data perturbations without distorting the original dynamics -of the infant movements. Our CP prediction model utilizes an ensemble approach, -so we evaluate the XAI metrics performances for both the overall ensemble and -the individual models. Our findings indicate that both XAI methods effectively -identify key body points influencing CP predictions and that the explanations -are robust against minor data perturbations. Grad-CAM significantly outperforms -CAM in the RISv metric, which measures stability in terms of velocity. In -contrast, CAM performs better in the RISb metric, which relates to bone -stability, and the RRS metric, which assesses internal representation -robustness. Individual models within the ensemble show varied results, and -neither CAM nor Grad-CAM consistently outperform the other, with the ensemble -approach providing a representation of outcomes from its constituent models. +摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 -摘要:腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性,使用深度學習方法,透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說,我們使用 XAI 評估指標(即忠實度和穩定性)來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集,並應用骨骼資料擾動,而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法,因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明,兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位,並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM,該指標衡量速度方面的穩定性。相比之下,CAM 在 RISb 指標中表現得更好,該指標與骨骼穩定性有關,而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果,CAM 和 Grad-CAM 都不一致地優於另一種,整體方法提供了其組成模型結果的表示。 +##### **A Survey on Text-Driven 360-Degree Panorama Generation** +2502.14799v1 by Hai Wang, Xiaoyu Xiang, Weihao Xia, Jing-Hao Xue -##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy** -2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma +The advent of text-driven 360-degree panorama generation, enabling the +synthesis of 360-degree panoramic images directly from textual descriptions, +marks a transformative advancement in immersive visual content creation. This +innovation significantly simplifies the traditionally complex process of +producing such content. Recent progress in text-to-image diffusion models has +accelerated the rapid development in this emerging field. This survey presents +a comprehensive review of text-driven 360-degree panorama generation, offering +an in-depth analysis of state-of-the-art algorithms and their expanding +applications in 360-degree 3D scene generation. Furthermore, we critically +examine current limitations and propose promising directions for future +research. A curated project page with relevant resources and research papers is +available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/. -Recent global estimates suggest that as many as 2.41 billion individuals have -health conditions that would benefit from rehabilitation services. Home-based -Physical Therapy (PT) faces significant challenges in providing interactive -feedback and meaningful observation for therapists and patients. To fill this -gap, we present MicroXercise, which integrates micro-motion analysis with -wearable sensors, providing therapists and patients with a comprehensive -feedback interface, including video, text, and scores. Crucially, it employs -multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable -methods to analyze the existing deep learning neural networks in monitoring -exercises, focusing on a high granularity of exercise. This synergistic -approach is pivotal, providing output matching the input size to precisely -highlight critical subtleties and movements in PT, thus transforming complex AI -analysis into clear, actionable feedback. By highlighting these micro-motions -in different metrics, such as stability and range of motion, MicroXercise -significantly enhances the understanding and relevance of feedback for -end-users. Comparative performance metrics underscore its effectiveness over -traditional methods, such as a 39% and 42% improvement in Feature Mutual -Information (FMI) and Continuity. MicroXercise is a step ahead in home-based -physical therapy, providing a technologically advanced and intuitively helpful -solution to enhance patient care and outcomes. +摘要:文字驅動 360 度全景圖生成技術的出現,使能從文字描述中直接合成 360 度全景圖像,標誌著沉浸式視覺內容創作的變革性進展。這項創新顯著簡化了傳統上複雜的製作此類內容的過程。最近在文字轉圖像擴散模型方面的進展加速了這個新興領域的快速發展。本調查提供了對文字驅動 360 度全景圖生成的全面回顧,深入分析了最先進的演算法及其在 360 度 3D 場景生成中的擴展應用。此外,我們批判性地審視了當前的限制,並提出了未來研究的有希望的方向。一個精選的專案頁面,其中包含相關資源和研究論文,可在 https://littlewhitesea.github.io/Text-Driven-Pano-Gen/ 獲得。 -摘要:最近的全球估計表明,多達 24.1 億人有 -健康狀況可從復健服務中受益。居家 -物理治療 (PT) 在提供互動式 -回饋和有意義的觀察方面面臨重大挑戰,供治療師和患者使用。為了填補這 -個缺口,我們提出 MicroXercise,它將微動作分析與 -可穿戴式感測器整合在一起,為治療師和患者提供一個全面的 -回饋介面,包括影片、文字和分數。至關重要的是,它採用 -多維動態時間規整 (DTW) 和基於歸因的可解釋 -方法來分析監控運動中現有的深度學習神經網路,專注於運動的高粒度。這種協同 -方法至關重要,提供與輸入大小匹配的輸出,以精確地 -突出 PT 中關鍵的細微差別和動作,從而將複雜的 AI -分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作,例如穩定性和動作範圍,MicroXercise -顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於 -傳統方法的有效性,例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家 -物理治療方面更進一步,提供技術先進且直覺有用的 -解決方案,以提升患者照護和結果。 +##### **Rapid Word Learning Through Meta In-Context Learning** +2502.14791v1 by Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake -##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development** -2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz +Humans can quickly learn a new word from a few illustrative examples, and +then systematically and flexibly use it in novel contexts. Yet the abilities of +current language models for few-shot word learning, and methods for improving +these abilities, are underexplored. In this study, we introduce a novel method, +Meta-training for IN-context learNing Of Words (Minnow). This method trains +language models to generate new examples of a word's usage given a few +in-context examples, using a special placeholder token to represent the new +word. This training is repeated on many new words to develop a general +word-learning ability. We find that training models from scratch with Minnow on +human-scale child-directed language enables strong few-shot word learning, +comparable to a large language model (LLM) pre-trained on orders of magnitude +more data. Furthermore, through discriminative and generative evaluations, we +demonstrate that finetuning pre-trained LLMs with Minnow improves their ability +to discriminate between new words, identify syntactic categories of new words, +and generate reasonable new usages and definitions for new words, based on one +or a few in-context examples. These findings highlight the data efficiency of +Minnow and its potential to improve language model performance in word learning +tasks. -Systematic literature reviews are the highest quality of evidence in -research. However, the review process is hindered by significant resource and -data constraints. The Literature Review Network (LRN) is the first of its kind -explainable AI platform adhering to PRISMA 2020 standards, designed to automate -the entire literature review process. LRN was evaluated in the domain of -surgical glove practices using 3 search strings developed by experts to query -PubMed. A non-expert trained all LRN models. Performance was benchmarked -against an expert manual review. Explainability and performance metrics -assessed LRN's ability to replicate the experts' review. Concordance was -measured with the Jaccard index and confusion matrices. Researchers were -blinded to the other's results until study completion. Overlapping studies were -integrated into an LRN-generated systematic review. LRN models demonstrated -superior classification accuracy without expert training, achieving 84.78% and -85.71% accuracy. The highest performance model achieved high interrater -reliability (k = 0.4953) and explainability metrics, linking 'reduce', -'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51% -of the relevant literature despite diverging from the non-expert's judgments (k -= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN -outperformed the manual review (19,920 minutes over 11 months), reducing the -entire process to 288.6 minutes over 5 days. This study demonstrates that -explainable AI does not require expert training to successfully conduct -PRISMA-compliant systematic literature reviews like an expert. LRN summarized -the results of surgical glove studies and identified themes that were nearly -identical to the clinical researchers' findings. Explainable AI can accurately -expedite our understanding of clinical practices, potentially revolutionizing -healthcare research. +摘要:人類可以從幾個說明性的範例中快速學習一個新字詞,然後系統性且靈活地將其用於新的脈絡中。然而,目前語言模型在少量字詞學習中的能力,以及改善這些能力的方法,尚未得到充分探討。在這項研究中,我們引入了一種新方法,即「用於字詞情境學習的元訓練」(Minnow)。此方法訓練語言模型在給定幾個情境範例的情況下,產生字詞用法的範例,並使用特殊佔位符標記來表示新的字詞。此訓練會在許多新字詞上重複進行,以培養一般的字詞學習能力。我們發現,從頭開始使用 Minnow 在人類規模的兒童導向語言上訓練模型,可以實現強大的少量字詞學習能力,這與預先在大量資料上訓練的大型語言模型 (LLM) 相當。此外,透過區辨性和生成性評估,我們證明使用 Minnow 微調預先訓練的 LLM 可以提升其區辨新字詞、識別新字詞的句法類別,以及根據一個或幾個情境範例產生合理的新用法和定義的能力。這些發現突顯了 Minnow 的資料效率,以及它在字詞學習任務中提升語言模型效能的潛力。 -摘要:系統性文獻回顧是研究中證據品質最高的。然而,回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台,旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估,使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率,達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標,將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻,儘管與非專家的判斷不同 (k = 0.2174),但包含了「乳膠」、「雙重」(手套)和「適應症」等詞彙。LRN 優於手動回顧(11 個月超過 19,920 分鐘),將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示,可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果,並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解,有潛力革新醫療保健研究。 +##### **SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features** +2502.14786v1 by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai -##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns** -2408.02709v1 by Chi Him Ng +We introduce SigLIP 2, a family of new multilingual vision-language encoders +that build on the success of the original SigLIP. In this second iteration, we +extend the original image-text training objective with several prior, +independently developed techniques into a unified recipe -- this includes +captioning-based pretraining, self-supervised losses (self-distillation, masked +prediction) and online data curation. With these changes, SigLIP 2 models +outperform their SigLIP counterparts at all model scales in core capabilities, +including zero-shot classification, image-text retrieval, and transfer +performance when extracting visual representations for Vision-Language Models +(VLMs). Furthermore, the new training recipe leads to significant improvements +on localization and dense prediction tasks. We also train variants which +support multiple resolutions and preserve the input's native aspect ratio. +Finally, we train on a more diverse data-mixture that includes de-biasing +techniques, leading to much better multilingual understanding and improved +fairness. To allow users to trade off inference cost with performance, we +release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), +and g (1B). -This study analyzes hybrid AI systems' design patterns and their -effectiveness in clinical decision-making using the boxology framework. It -categorizes and copares various architectures combining machine learning and -rule-based reasoning to provide insights into their structural foundations and -healthcare applications. Addressing two main questions, how to categorize these -systems againts established design patterns and how to extract insights through -comparative analysis, the study uses design patterns from software engineering -to understand and optimize healthcare AI systems. Boxology helps identify -commonalities and create reusable solutions, enhancing these systems' -scalability, reliability, and performance. Five primary architectures are -examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and -weaknesses, highlighting the need for tailored approaches in clinical tasks. -REML excels in high-accuracy prediction for datasets with limited data; MLRB in -handling large datasets and complex data integration; RBML in explainability -and trustworthiness; RMLT in managing high-dimensional data; and PERML, though -limited in analysis, shows promise in urgent care scenarios. The study -introduces four new patterns, creates five abstract categorization patterns, -and refines those five further to specific systems. These contributions enhance -Boxlogy's taxonomical organization and offer novel approaches to integrating -expert knowledge with machine learning. Boxology's structured, modular apporach -offers significant advantages in developing and analyzing hybrid AI systems, -revealing commonalities, and promoting reusable solutions. In conclusion, this -study underscores hybrid AI systems' crucial role in advancing healthcare and -Boxology's potential to drive further innovation in AI integration, ultimately -improving clinical decision support and patient outcomes. +摘要:我們推出了 SigLIP 2,這是一個新的多語言視覺語言編碼器系列,它建立在 SigLIP 的成功基礎上。在這個第二個版本中,我們將原來的圖像文字訓練目標與幾個先前獨立開發的技術擴展到一個統一的配方中,其中包括基於標題的預訓練、自我監督損失(自我蒸餾、遮罩預測)和線上數據策展。有了這些改變,SigLIP 2 模型在所有模型規模上都超越了 SigLIP 的對應模型,包括零次分類、圖像文字檢索和在為視覺語言模型 (VLM) 提取視覺表示時傳輸效能。此外,新的訓練配方也大幅改善了定位和密集預測任務。我們還訓練了支援多種解析度和保留輸入原生長寬比的變體。最後,我們在一個更為多樣化的數據組合上進行訓練,其中包括去偏見技術,從而大幅提升多語言理解力並改善公平性。為了讓使用者權衡推理成本與效能,我們發布了四種大小的模型檢查點:ViT-B (86M)、L (303M)、So400m (400M) 和 g (1B)。 -摘要:本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構,以深入了解其結構基礎和醫療保健應用。針對兩個主要問題,如何根據既定的設計模式對這些系統進行分類,以及如何通過比較分析提取見解,本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案,從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構:REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點,強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測;MLRB 在處理大型資料集和複雜資料整合方面表現出色;RBML 在可解釋性和可信度方面表現出色;RMLT 在管理高維資料方面表現出色;而 PERML 儘管在分析方面有限,但在緊急照護場景中表現出潛力。本研究引入了四種新模式,建立了五種抽象分類模式,並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織,並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之,本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用,以及盒子學在推動人工智慧整合進一步創新方面的潛力,最終改善臨床決策支援和患者的治療成果。 +##### **ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting** +2502.14780v1 by Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, Minji Kim -##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability** -2408.02706v1 by Masoud Muhammed Hassan +Efficient and privacy-preserving multimodal interaction is essential as AR, +VR, and modern smartphones with powerful cameras become primary interfaces for +human-computer communication. Existing powerful large vision-language models +(VLMs) enabling multimodal interaction often rely on cloud-based processing, +raising significant concerns about (1) visual privacy by transmitting sensitive +vision data to servers, and (2) their limited real-time, on-device usability. +This paper explores Visual Instruction Rewriting, a novel approach that +transforms multimodal instructions into text-only commands, allowing seamless +integration of lightweight on-device instruction rewriter VLMs (250M +parameters) with existing conversational AI systems, enhancing vision data +privacy. To achieve this, we present a dataset of over 39,000 examples across +14 domains and develop a compact VLM, pretrained on image captioning datasets +and fine-tuned for instruction rewriting. Experimental results, evaluated +through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic +parsing analysis, demonstrate that even a quantized version of the model +(<500MB storage footprint) can achieve effective instruction rewriting, thus +enabling privacy-focused, multimodal AI applications. -Because of its strong predictive skills, deep learning has emerged as an -essential tool in many industries, including healthcare. Traditional deep -learning models, on the other hand, frequently lack interpretability and omit -to take prediction uncertainty into account two crucial components of clinical -decision making. In order to produce explainable and uncertainty aware -predictions, this study presents a novel framework called Bayesian Kolmogorov -Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov -Arnold Networks with Bayesian inference. We employ BKANs on two medical -datasets, which are widely used benchmarks for assessing machine learning -models in medical diagnostics: the Pima Indians Diabetes dataset and the -Cleveland Heart Disease dataset. Our method provides useful insights into -prediction confidence and decision boundaries and outperforms traditional deep -learning models in terms of prediction accuracy. Moreover, BKANs' capacity to -represent aleatoric and epistemic uncertainty guarantees doctors receive more -solid and trustworthy decision support. Our Bayesian strategy improves the -interpretability of the model and considerably minimises overfitting, which is -important for tiny and imbalanced medical datasets, according to experimental -results. We present possible expansions to further use BKANs in more -complicated multimodal datasets and address the significance of these -discoveries for future research in building reliable AI systems for healthcare. -This work paves the way for a new paradigm in deep learning model deployment in -vital sectors where transparency and reliability are crucial. +摘要:高效且重視隱私的多模態互動至關重要,因為 AR、VR 和配備強大相機的現代智慧型手機已成為人機溝通的主要介面。現有的強大大型視覺語言模型 (VLM) 能支援多模態互動,通常仰賴雲端處理,這引發了重大的疑慮,包括:(1) 將敏感的視覺資料傳輸至伺服器,會造成視覺隱私問題,以及 (2) 其有限的即時、裝置上可用性。本文探討視覺指令改寫,這是一種新穎的方法,可將多模態指令轉換為純文字指令,讓輕量級的裝置上指令改寫 VLM (250M 參數) 與現有的對話式 AI 系統無縫整合,進而強化視覺資料的隱私。為達成此目標,我們提供一個跨越 14 個領域、超過 39,000 個範例的資料集,並開發一個精簡的 VLM,在圖片標題資料集上進行預訓練,並針對指令改寫進行微調。實驗結果透過 NLG 指標(例如 BLEU、METEOR 和 ROUGE)以及語意解析分析進行評估,證明即使是模型的量化版本(<500MB 儲存空間佔用量)也能有效執行指令改寫,進而支援注重隱私的多模態 AI 應用程式。 + +##### **Harnessing PDF Data for Improving Japanese Large Multimodal Models** +2502.14778v1 by Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa + +Large Multimodal Models (LMMs) have demonstrated strong performance in +English, but their effectiveness in Japanese remains limited due to the lack of +high-quality training data. Current Japanese LMMs often rely on translated +English datasets, restricting their ability to capture Japan-specific cultural +knowledge. To address this, we explore the potential of Japanese PDF data as a +training resource, an area that remains largely underutilized. We introduce a +fully automated pipeline that leverages pretrained models to extract image-text +pairs from PDFs through layout analysis, OCR, and vision-language pairing, +removing the need for manual annotation. Additionally, we construct instruction +data from extracted image-text pairs to enrich the training data. To evaluate +the effectiveness of PDF-derived data, we train Japanese LMMs and assess their +performance on the Japanese LMM Benchmark. Our results demonstrate substantial +improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. +Further analysis highlights the impact of PDF-derived data on various factors, +such as model size and language models, reinforcing its value as a multimodal +resource for Japanese LMMs. We plan to make the source code and data publicly +available upon acceptance. -摘要:由於其強大的預測能力,深度學習已成為許多產業中不可或缺的工具,包括醫療保健。然而,傳統的深度學習模型通常缺乏可解釋性,並且忽略了將預測不確定性納入考量,而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測,本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構,它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN,這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準:皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解,並且在預測準確度方面優於傳統的深度學習模型。此外,BKAN 表現隨機和認識不確定性的能力,可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果,我們的貝氏策略提高了模型的可解釋性,並大幅減少了過度擬合,這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能,以進一步將 BKAN 用於更複雜的多模式資料集,並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。 +摘要:大型多模態模型 (LMM) 已在英語中表現出強勁的效能,但由於缺乏高品質的訓練資料,它們在日語中的效能仍然有限。目前的日語 LMM 通常依賴於翻譯後的英語資料集,限制了它們擷取特定於日本的文化知識的能力。為了解決這個問題,我們探索了日語 PDF 資料作為訓練資源的潛力,這個領域在很大程度上仍然未被充分利用。我們引入了一個全自動的管道,利用預先訓練好的模型透過版面分析、光學字元辨識和視覺語言配對從 PDF 中擷取影像文字對,消除了手動註解的需要。此外,我們從擷取的影像文字對中建構說明資料,以豐富訓練資料。為了評估 PDF 衍生資料的效能,我們訓練了日語 LMM,並在日語 LMM 基準上評估它們的效能。我們的結果證明了顯著的進步,在 Heron-Bench 上的效能提升幅度從 3.9% 到 13.8%。進一步的分析重點說明了 PDF 衍生資料對各種因素的影響,例如模型大小和語言模型,加強了其作為日語 LMM 的多模態資源的價值。我們計畫在接受後公開原始程式碼和資料。 -##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI** -2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal +##### **Making Universal Policies Universal** +2502.14777v1 by Niklas Höpner, David Kuric, Herke van Hoof -In modern healthcare, addressing the complexities of accurate disease -prediction and personalized recommendations is both crucial and challenging. -This research introduces MLtoGAI, which integrates Semantic Web technology with -Machine Learning (ML) to enhance disease prediction and offer user-friendly -explanations through ChatGPT. The system comprises three key components: a -reusable disease ontology that incorporates detailed knowledge about various -diseases, a diagnostic classification model that uses patient symptoms to -detect specific diseases accurately, and the integration of Semantic Web Rule -Language (SWRL) with ontology and ChatGPT to generate clear, personalized -health advice. This approach significantly improves prediction accuracy and -ensures results that are easy to understand, addressing the complexity of -diseases and diverse symptoms. The MLtoGAI system demonstrates substantial -advancements in accuracy and user satisfaction, contributing to developing more -intelligent and accessible healthcare solutions. This innovative approach -combines the strengths of ML algorithms with the ability to provide -transparent, human-understandable explanations through ChatGPT, achieving -significant improvements in prediction accuracy and user comprehension. By -leveraging semantic technology and explainable AI, the system enhances the -accuracy of disease prediction and ensures that the recommendations are -relevant and easily understood by individual patients. Our research highlights -the potential of integrating advanced technologies to overcome existing -challenges in medical diagnostics, paving the way for future developments in -intelligent healthcare systems. Additionally, the system is validated using 200 -synthetic patient data records, ensuring robust performance and reliability. +The development of a generalist agent capable of solving a wide range of +sequential decision-making tasks remains a significant challenge. We address +this problem in a cross-agent setup where agents share the same observation +space but differ in their action spaces. Our approach builds on the universal +policy framework, which decouples policy learning into two stages: a +diffusion-based planner that generates observation sequences and an inverse +dynamics model that assigns actions to these plans. We propose a method for +training the planner on a joint dataset composed of trajectories from all +agents. This method offers the benefit of positive transfer by pooling data +from different agents, while the primary challenge lies in adapting shared +plans to each agent's unique constraints. We evaluate our approach on the +BabyAI environment, covering tasks of varying complexity, and demonstrate +positive transfer across agents. Additionally, we examine the planner's +generalisation ability to unseen agents and compare our method to traditional +imitation learning approaches. By training on a pooled dataset from multiple +agents, our universal policy achieves an improvement of up to $42.20\%$ in task +completion accuracy compared to a policy trained on a dataset from a single +agent. -摘要:在現代醫療保健中,解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI,它將語義網路技術與機器學習 (ML) 相結合,以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分:一個可重複使用的疾病本体,其中包含有關各種疾病的詳細知識;一個診斷分類模型,它使用患者症狀來準確檢測特定疾病;以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合,以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性,並確保了易於理解的結果,解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步,有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點,以及透過 ChatGPT 提供透明且人類可以理解的說明的能力,在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI,該系統提高了疾病預測的準確性,並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力,為智慧醫療保健系統的未來發展鋪路。此外,該系統使用 200 個合成患者資料記錄進行驗證,確保了穩健的效能和可靠性。 +摘要:開發一種能夠解決廣泛順序決策任務的通才代理仍然是一項重大挑戰。我們在跨代理設置中解決這個問題,其中代理共享相同的觀察空間,但在其動作空間中有所不同。我們的做法建立在通用策略框架之上,該框架將策略學習解耦為兩個階段:生成觀察序列的基於擴散的規劃器和將動作分配給這些計劃的逆動態模型。我們提出了一種在由所有代理的軌跡組成的聯合數據集上訓練規劃器的方法。這種方法提供了通過彙總來自不同代理的數據來進行正向傳輸的好處,而主要的挑戰在於將共享計劃適應於每個代理的唯一約束。我們在 BabyAI 環境中評估了我們的做法,涵蓋了不同複雜程度的任務,並展示了跨代理的正向傳輸。此外,我們檢查了規劃器對未見代理的概括能力,並將我們的做法與傳統的模仿學習方法進行了比較。通過在來自多個代理的彙總數據集上進行訓練,我們的通用策略在任務完成準確度方面實現了高達 42.20% 的改進,而從單個代理的數據集上訓練的策略。 -##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations** -2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora +##### **SurveyX: Academic Survey Automation via Large Language Models** +2502.14776v1 by Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li -Explainable Artificial Intelligence (XAI) is central to the debate on -integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms -into clinical practice. High-performing AI/ML models, such as ensemble learners -and deep neural networks, often lack interpretability, hampering clinicians' -trust in their predictions. To address this, XAI techniques are being developed -to describe AI/ML predictions in human-understandable terms. One promising -direction is the adaptation of sensitivity analysis (SA) and global sensitivity -analysis (GSA), which inherently rank model inputs by their impact on -predictions. Here, we introduce a novel delta-XAI method that provides local -explanations of ML model predictions by extending the delta index, a GSA -metric. The delta-XAI index assesses the impact of each feature's value on the -predicted output for individual instances in both regression and classification -problems. We formalize the delta-XAI index and provide code for its -implementation. The delta-XAI method was evaluated on simulated scenarios using -linear regression models, with Shapley values serving as a benchmark. Results -showed that the delta-XAI index is generally consistent with Shapley values, -with notable discrepancies in models with highly impactful or extreme feature -values. The delta-XAI index demonstrated higher sensitivity in detecting -dominant features and handling extreme feature values. Qualitatively, the -delta-XAI provides intuitive explanations by leveraging probability density -functions, making feature rankings clearer and more explainable for -practitioners. Overall, the delta-XAI method appears promising for robustly -obtaining local explanations of ML model predictions. Further investigations in -real-world clinical settings will be conducted to evaluate its impact on -AI-assisted clinical workflows. +Large Language Models (LLMs) have demonstrated exceptional comprehension +capabilities and a vast knowledge base, suggesting that LLMs can serve as +efficient tools for automated survey generation. However, recent research +related to automated survey generation remains constrained by some critical +limitations like finite context window, lack of in-depth content discussion, +and absence of systematic evaluation frameworks. Inspired by human writing +processes, we propose SurveyX, an efficient and organized system for automated +survey generation that decomposes the survey composing process into two phases: +the Preparation and Generation phases. By innovatively introducing online +reference retrieval, a pre-processing method called AttributeTree, and a +re-polishing process, SurveyX significantly enhances the efficacy of survey +composition. Experimental evaluation results show that SurveyX outperforms +existing automated survey generation systems in content quality (0.259 +improvement) and citation quality (1.76 enhancement), approaching human expert +performance across multiple evaluation dimensions. Examples of surveys +generated by SurveyX are available on www.surveyx.cn -摘要:可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型,例如整體學習器和深度神經網路,通常缺乏可解釋性,阻礙臨床醫生對其預測的信任。為了解決這個問題,正在開發 XAI 技術,以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA),它們本質上會依據模型輸入對預測的影響來對其進行排名。在此,我們介紹一種新的 delta-XAI 方法,透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化,並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法,並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致,但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說,delta-XAI 透過利用機率密度函數提供直觀的解釋,使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言,delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查,以評估其對 AI 輔助臨床工作流程的影響。 +摘要:大型語言模型 (LLM) 已展現出卓越的理解能力和廣泛的知識庫,表示 LLM 可作為自動調查生成的有用工具。然而,與自動調查生成相關的最新研究仍受到一些關鍵限制的約束,例如有限的上下文視窗、缺乏深入的內容討論以及系統評估架構的缺失。受到人類寫作過程的啟發,我們提出 SurveyX,這是一個用於自動調查生成的有效且有組織的系統,它將調查組成過程分解為兩個階段:準備和生成階段。透過創新地引入線上參考檢索、一種稱為 AttributeTree 的預處理方法和重新潤飾過程,SurveyX 大幅提升了調查組成的效能。實驗評估結果顯示,SurveyX 在內容品質(提升 0.259)和引用品質(提升 1.76)方面優於現有的自動調查生成系統,在多個評估面向中接近人類專家的表現。由 SurveyX 生成的調查範例可在 www.surveyx.cn 取得 -##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population** -2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis +##### **Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning** +2502.14768v1 by Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo -Dementia, a debilitating neurological condition affecting millions worldwide, -presents significant diagnostic challenges. In this work, we introduce a novel -methodology for the classification of demented and non-demented elderly -patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach -features a unique technique for selectively processing MRI slices, focusing on -the most relevant brain regions and excluding less informative sections. This -methodology is complemented by a confidence-based classification committee -composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and -Dem3D EfficientNet. These models work synergistically to enhance -decision-making accuracy, leveraging their collective strengths. Tested on the -Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an -impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, -validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset -confirmed the robustness and generalizability of our approach. The use of -explainable AI (XAI) techniques and comprehensive ablation studies further -substantiate the effectiveness of our techniques, providing insights into the -decision-making process and the importance of our methodology. This research -offers a significant advancement in dementia diagnosis, providing a highly -accurate and efficient tool for clinical applications. +Inspired by the success of DeepSeek-R1, we explore the potential of +rule-based reinforcement learning (RL) in large reasoning models. To analyze +reasoning dynamics, we use synthetic logic puzzles as training data due to +their controllable complexity and straightforward answer verification. We make +some key technical contributions that lead to effective and stable RL training: +a system prompt that emphasizes the thinking and answering process, a stringent +format reward function that penalizes outputs for taking shortcuts, and a +straightforward training recipe that achieves stable convergence. Our 7B model +develops advanced reasoning skills-such as reflection, verification, and +summarization-that are absent from the logic corpus. Remarkably, after training +on just 5K logic problems, it demonstrates generalization abilities to the +challenging math benchmarks AIME and AMC. -摘要:失智症是一種影響全球數百萬人的衰弱性神經疾病,在診斷上具有重大挑戰。在這項工作中,我們提出了一種新的方法,用於對失智和非失智老年患者進行分類,使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術,用於選擇性處理 MRI 切片,重點關注最相關的大腦區域,並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充,該委員會由三個自定義深度學習模型組成:Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性,利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試,我們的模型達到了 94.12% 的驚人準確度,超過了現有方法。此外,在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性,提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展,為臨床應用提供了一個高度準確且高效的工具。 +摘要:在 DeepSeek-R1 成功案例的启发下,我们探索了基于规则的强化学习 (RL) 在大型推理模型中的潜力。为了分析推理动态,我们使用合成逻辑难题作为训练数据,因为它们的可控复杂性和直接的答案验证。我们做出了一些关键的技术贡献,这些贡献导致了有效且稳定的 RL 训练:一个强调思考和回答过程的系统提示、一个严格的格式奖励函数,用于惩罚采取捷径的输出,以及一个实现稳定收敛的直接训练配方。我们的 7B 模型发展了高级推理技能,例如反射、验证和总结,这些技能在逻辑语料库中是不存在的。值得注意的是,在仅对 5K 个逻辑问题进行训练后,它展示了对具有挑战性的数学基准 AIME 和 AMC 的泛化能力。 -##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition** -2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini +##### **Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis** +2502.14767v1 by Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han -Recognizing daily activities with unobtrusive sensors in smart environments -enables various healthcare applications. Monitoring how subjects perform -activities at home and their changes over time can reveal early symptoms of -health issues, such as cognitive decline. Most approaches in this field use -deep learning models, which are often seen as black boxes mapping sensor data -to activities. However, non-expert users like clinicians need to trust and -understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human -Activity Recognition have emerged to provide intuitive natural language -explanations from these models. Different XAI methods generate different -explanations, and their effectiveness is typically evaluated through user -surveys, that are often challenging in terms of costs and fairness. This paper -proposes an automatic evaluation method using Large Language Models (LLMs) to -identify, in a pool of candidates, the best XAI approach for non-expert users. -Our preliminary results suggest that LLM evaluation aligns with user surveys. +With the exponential growth of research facilitated by modern technology and +improved accessibility, scientific discoveries have become increasingly +fragmented within and across fields. This makes it challenging to assess the +significance, novelty, incremental findings, and equivalent ideas between +related works, particularly those from different research communities. Large +language models (LLMs) have recently demonstrated strong quantitative and +qualitative reasoning abilities, and multi-agent LLM debates have shown promise +in handling complex reasoning tasks by exploring diverse perspectives and +reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a +framework which converts scientific papers into LLM personas that debate their +respective novelties. To emphasize structured, critical reasoning rather than +focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling +fine-grained analysis of independent novelty arguments within scholarly +articles. Through experiments on scientific literature across various domains, +evaluated by expert researchers, we demonstrate that ToD generates informative +arguments, effectively contrasts papers, and supports researchers in their +literature review. -摘要:藉由智慧環境中不引人注目的感測器辨識日常活動,能啟用各種醫療保健應用。監控受試者在家中如何執行活動,以及其隨著時間的變化,可以揭示健康問題的早期症狀,例如認知能力下降。此領域中的大多數方法都使用深度學習模型,這些模型通常被視為將感測器資料對應至活動的黑盒子。然而,非專家使用者(例如臨床醫師)需要信任並了解這些模型的輸出。因此,人類活動辨識的可解釋 AI (XAI) 方法應運而生,以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明,而其有效性通常透過使用者調查來評估,這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法,以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明,LLM 評估與使用者調查一致。 +摘要:隨著現代科技促進的研究呈指數成長,加上可近性的提升,科學發現已在各領域內外變得越來越分散。這使得評估相關作品之間的重要性、新穎性、漸進式發現和等價概念變得具有挑戰性,特別是來自不同研究社群的作品。大型語言模型 (LLM) 近期已展現出強大的量化和質化推理能力,而多重代理 LLM 辯論已在處理複雜推理任務方面展現出潛力,方法是探索不同的觀點和推理路徑。受到此啟發,我們引入了辯論樹 (ToD),這是一個將科學論文轉換為 LLM 人格的架構,這些人格會辯論各自的新穎性。為了強調結構化、批判性推理,而非僅專注於結果,ToD 會動態建構一個辯論樹,讓使用者能夠深入分析學術文章中獨立的新穎性論點。透過在不同領域的科學文獻上進行實驗,並由專家研究員進行評估,我們證明了 ToD 能產生有見地的論點、有效對比論文,並在研究人員的文獻回顧中提供協助。 -##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions** -2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil +##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** +2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes -Industry 5.0, which focuses on human and Artificial Intelligence (AI) -collaboration for performing different tasks in manufacturing, involves a -higher number of robots, Internet of Things (IoTs) devices and -interconnections, Augmented/Virtual Reality (AR), and other smart devices. The -huge involvement of these devices and interconnection in various critical -areas, such as economy, health, education and defense systems, poses several -types of potential security flaws. AI itself has been proven a very effective -and powerful tool in different areas of cybersecurity, such as intrusion -detection, malware detection, and phishing detection, among others. Just as in -many application areas, cybersecurity professionals were reluctant to accept -black-box ML solutions for cybersecurity applications. This reluctance pushed -forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool -that helps explain how decisions are made in ML-based systems. In this survey, -we present a comprehensive study of different XAI-based intrusion detection -systems for industry 5.0, and we also examine the impact of explainability and -interpretability on Cybersecurity practices through the lens of Adversarial -XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities -and challenges in XAI cybersecurity systems for industry 5.0 that elicit future -research toward XAI-based solutions to be adopted by high-stakes industry 5.0 -applications. We believe this rigorous analysis will establish a foundational -framework for subsequent research endeavors within the specified domain. +Fact verification (FV) aims to assess the veracity of a claim based on +relevant evidence. The traditional approach for automated FV includes a +three-part pipeline relying on short evidence snippets and encoder-only +inference models. More recent approaches leverage the multi-turn nature of LLMs +to address FV as a step-by-step problem where questions inquiring additional +context are generated and answered until there is enough information to make a +decision. This iterative method makes the verification process rational and +explainable. While these methods have been tested for encyclopedic claims, +exploration on domain-specific and realistic claims is missing. In this work, +we apply an iterative FV system on three medical fact-checking datasets and +evaluate it with multiple settings, including different LLMs, external web +search, and structured reasoning using logic predicates. We demonstrate +improvements in the final performance over traditional approaches and the high +potential of step-by-step FV systems for domain-specific claims. -摘要:工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務,涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與,引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具,例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣,網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用,有助於說明在基於 ML 的系統中如何做出決策。在這項調查中,我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究,並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外,我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰,引發了未來針對 XAI 基礎解決方案的研究,以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。 +摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 -##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability** -2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic +##### **EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations** +2502.14760v1 by Haotian Zhai, Connor Lawless, Ellen Vitercik, Liu Leqi -This study aims to explore the implementation of Natural Language Processing -(NLP) and machine learning (ML) techniques to automate the coding of medical -letters with visualised explainability and light-weighted local computer -settings. Currently in clinical settings, coding is a manual process that -involves assigning codes to each condition, procedure, and medication in a -patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There -are preliminary research on automatic coding in this field using -state-of-the-art ML models; however, due to the complexity and size of the -models, the real-world deployment is not achieved. To further facilitate the -possibility of automatic coding practice, we explore some solutions in a local -computer setting; in addition, we explore the function of explainability for -transparency of AI models. We used the publicly available MIMIC-III database -and the HAN/HLAN network models for ICD code prediction purposes. We also -experimented with the mapping between ICD and SNOMED CT knowledge bases. In our -experiments, the models provided useful information for 97.98\% of codes. The -result of this investigation can shed some light on implementing automatic -clinical coding in practice, such as in hospital settings, on the local -computers used by clinicians , project page -\url{https://github.com/Glenj01/Medical-Coding}. +A fundamental problem in combinatorial optimization is identifying equivalent +formulations, which can lead to more efficient solution strategies and deeper +insights into a problem's computational complexity. The need to automatically +identify equivalence between problem formulations has grown as optimization +copilots--systems that generate problem formulations from natural language +descriptions--have proliferated. However, existing approaches to checking +formulation equivalence lack grounding, relying on simple heuristics which are +insufficient for rigorous validation. Inspired by Karp reductions, in this work +we introduce quasi-Karp equivalence, a formal criterion for determining when +two optimization formulations are equivalent based on the existence of a +mapping between their decision variables. We propose EquivaMap, a framework +that leverages large language models to automatically discover such mappings, +enabling scalable and reliable equivalence verification. To evaluate our +approach, we construct the first open-source dataset of equivalent optimization +formulations, generated by applying transformations such as adding slack +variables or valid inequalities to existing formulations. Empirically, +EquivaMap significantly outperforms existing methods, achieving substantial +improvements in correctly identifying formulation equivalence. -摘要:本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化,並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中,編碼是一種手動流程,涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如,使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究;然而,由於模型的複雜性和大小,並未實現實際部署。為了進一步促進自動編碼實務的可能性,我們在本地電腦設定中探討了一些解決方案;此外,我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中,這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解,例如在醫院環境中,由臨床醫生使用的本地電腦,專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。 +摘要:組合優化中的基本問題在於識別等效公式,這可能導致更有效的解決策略,並更深入地了解問題的計算複雜性。隨著優化輔助系統(從自然語言描述中產生問題公式的系統)的普及,自動識別問題公式之間等價性的需求也隨之增加。然而,現有的公式等價性檢查方法缺乏依據,依賴於簡單的啟發法,而這對於嚴格驗證來說是不夠的。受 Karp 遞減啟發,我們在這項工作中引入了準 Karp 等價性,這是一個正式標準,用於根據決策變數之間的映射存在性來確定兩個優化公式何時等效。我們提出了 EquivaMap,一個利用大型語言模型自動發現此類映射的框架,實現可擴充且可靠的等價性驗證。為了評估我們的做法,我們構建了第一個等效優化公式的開源資料集,該資料集是通過對現有公式套用轉換(例如添加鬆弛變數或有效不等式)產生的。根據經驗,EquivaMap 明顯優於現有方法,在正確識別公式等價性方面取得了顯著進展。 -##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation** -2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier +##### **On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems** +2502.14759v1 by Juraj Vladika, Florian Matthes -The support of artificial intelligence (AI) based decision-making is a key -element in future 6G networks, where the concept of native AI will be -introduced. Moreover, AI is widely employed in different critical applications -such as autonomous driving and medical diagnosis. In such applications, using -AI as black-box models is risky and challenging. Hence, it is crucial to -understand and trust the decisions taken by these models. Tackling this issue -can be achieved by developing explainable AI (XAI) schemes that aim to explain -the logic behind the black-box model behavior, and thus, ensure its efficient -and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST -framework that is oriented toward channel estimation in wireless -communications. The core idea of the XAI-CHEST framework is to identify the -relevant model inputs by inducing high noise on the irrelevant ones. This -manuscript provides the detailed theoretical foundations of the XAI-CHEST -framework. In particular, we derive the analytical expressions of the XAI-CHEST -loss functions and the noise threshold fine-tuning optimization problem. Hence -the designed XAI-CHEST delivers a smart input feature selection methodology -that can further improve the overall performance while optimizing the -architecture of the employed model. Simulation results show that the XAI-CHEST -framework provides valid interpretations, where it offers an improved bit error -rate performance while reducing the required computational complexity in -comparison to the classical DL-based channel estimation. +Retrieval-augmented generation (RAG) has emerged as an approach to augment +large language models (LLMs) by reducing their reliance on static knowledge and +improving answer factuality. RAG retrieves relevant context snippets and +generates an answer based on them. Despite its increasing industrial adoption, +systematic exploration of RAG components is lacking, particularly regarding the +ideal size of provided context, and the choice of base LLM and retrieval +method. To help guide development of robust RAG systems, we evaluate various +context sizes, BM25 and semantic search as retrievers, and eight base LLMs. +Moving away from the usual RAG evaluation with short answers, we explore the +more challenging long-form question answering in two domains, where a good +answer has to utilize the entire context. Our findings indicate that final QA +performance improves steadily with up to 15 snippets but stagnates or declines +beyond that. Finally, we show that different general-purpose LLMs excel in the +biomedical domain than the encyclopedic one, and that open-domain evidence +retrieval in large corpora is challenging. -摘要:人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素,其中將引入原生 AI 的概念。此外,AI 廣泛用於不同的關鍵應用中,例如自動駕駛和醫療診斷。在這些應用中,使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此,理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構,旨在解釋黑盒模型行為背後的邏輯,從而確保其有效且安全的部署。最近,我們提出了一個新的基於擾動的 XAI-CHEST 框架,該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是,我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此,設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法,可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明,XAI-CHEST 框架提供了有效的解釋,在降低所需的計算複雜度的同時,提供了改進的比特錯誤率性能,而這與基於傳統 DL 的信道估計相比。 +摘要:檢索增強生成 (RAG) 已成為一種方法,可透過減少大型語言模型 (LLM) 對靜態知識的依賴,並改善答案的真實性,來增強大型語言模型 (LLM)。RAG 會擷取相關的內容片段,並根據這些片段產生答案。儘管其產業採用率不斷提高,但缺乏對 RAG 組成的系統性探討,特別是在提供的內容的理想大小,以及基礎 LLM 和檢索方法的選擇方面。為了協助引導穩健 RAG 系統的開發,我們評估了各種內容大小、BM25 和語意搜尋作為檢索器,以及八個基礎 LLM。我們不再使用簡短答案進行常見的 RAG 評估,而是探討在兩個領域中更具挑戰性的長篇問答,其中一個好的答案必須利用整個內容。我們的研究結果指出,最終的問答效能會隨著多達 15 個片段而穩定提升,但在超過這個數量後就會停滯或下降。最後,我們表明不同的通用 LLM 在生物醫學領域比百科全書領域更為出色,而且在大型語料庫中進行開放領域證據檢索具有挑戰性。 -##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification** -2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman +##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** +2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari -This paper presents dilated Residual Network (ResNet) models for disease -classification from retinal fundus images. Dilated convolution filters are used -to replace normal convolution filters in the higher layers of the ResNet model -(dilated ResNet) in order to improve the receptive field compared to the normal -ResNet model for disease classification. This study introduces -computer-assisted diagnostic tools that employ deep learning, enhanced with -explainable AI techniques. These techniques aim to make the tool's -decision-making process transparent, thereby enabling medical professionals to -understand and trust the AI's diagnostic decision. They are particularly -relevant in today's healthcare landscape, where there is a growing demand for -transparency in AI applications to ensure their reliability and ethical use. -The dilated ResNet is used as a replacement for the normal ResNet to enhance -the classification accuracy of retinal eye diseases and reduce the required -computing time. The dataset used in this work is the Ocular Disease Intelligent -Recognition (ODIR) dataset which is a structured ophthalmic database with eight -classes covering most of the common retinal eye diseases. The evaluation -metrics used in this work include precision, recall, accuracy, and F1 score. In -this work, a comparative study has been made between normal ResNet models and -dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, -ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as -compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, -and 0.70 respectively for the above respective variants in ODIR multiclass -disease classification. +Medical images are acquired at high resolutions with large fields of view in +order to capture fine-grained features necessary for clinical decision-making. +Consequently, training deep learning models on medical images can incur large +computational costs. In this work, we address the challenge of downsizing +medical images in order to improve downstream computational efficiency while +preserving clinically-relevant features. We introduce MedVAE, a family of six +large-scale 2D and 3D autoencoders capable of encoding medical images as +downsized latent representations and decoding latent representations back to +high-resolution images. We train MedVAE autoencoders using a novel two-stage +training approach with 1,052,730 medical images. Across diverse tasks obtained +from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent +representations in place of high-resolution images when training downstream +models can lead to efficiency benefits (up to 70x improvement in throughput) +while simultaneously preserving clinically-relevant features and (2) MedVAE can +decode latent representations back to high-resolution images with high +fidelity. Our work demonstrates that large-scale, generalizable autoencoders +can help address critical efficiency challenges in the medical domain. Our code +is available at https://github.com/StanfordMIMI/MedVAE. -摘要:这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器(扩张 ResNet),以改善感知场,从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具,并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化,从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关,在该领域,对 AI 应用的透明度需求不断增长,以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品,以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集,这是一个结构化的眼科数据库,包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中,对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比,扩张 ResNet 模型显示出有希望的结果,在 ODIR 多类疾病分类中,上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。 +摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 -##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis** -2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li +##### **TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators** +2502.14752v1 by Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun -The rapid advancement of foundation models in medical imaging represents a -significant leap toward enhancing diagnostic accuracy and personalized -treatment. However, the deployment of foundation models in healthcare -necessitates a rigorous examination of their trustworthiness, encompassing -privacy, robustness, reliability, explainability, and fairness. The current -body of survey literature on foundation models in medical imaging reveals -considerable gaps, particularly in the area of trustworthiness. Additionally, -existing surveys on the trustworthiness of foundation models do not adequately -address their specific variations and applications within the medical imaging -domain. This survey aims to fill that gap by presenting a novel taxonomy of -foundation models used in medical imaging and analyzing the key motivations for -ensuring their trustworthiness. We review current research on foundation models -in major medical imaging applications, focusing on segmentation, medical report -generation, medical question and answering (Q\&A), and disease diagnosis. These -areas are highlighted because they have seen a relatively mature and -substantial number of foundation models compared to other applications. We -focus on literature that discusses trustworthiness in medical image analysis -manuscripts. We explore the complex challenges of building trustworthy -foundation models for each application, summarizing current concerns and -strategies for enhancing trustworthiness. Furthermore, we examine the potential -of these models to revolutionize patient care. Our analysis underscores the -imperative for advancing towards trustworthy AI in medical image analysis, -advocating for a balanced approach that fosters innovation while ensuring -ethical and equitable healthcare delivery. +Triton, a high-level Python-like language designed for building efficient GPU +kernels, is widely adopted in deep learning frameworks due to its portability, +flexibility, and accessibility. However, programming and parallel optimization +still require considerable trial and error from Triton developers. Despite +advances in large language models (LLMs) for conventional code generation, +these models struggle to generate accurate, performance-optimized Triton code, +as they lack awareness of its specifications and the complexities of GPU +programming. More critically, there is an urgent need for systematic +evaluations tailored to Triton. In this work, we introduce TritonBench, the +first comprehensive benchmark for Triton operator generation. TritonBench +features two evaluation channels: a curated set of 184 real-world operators +from GitHub and a collection of operators aligned with PyTorch interfaces. +Unlike conventional code benchmarks prioritizing functional correctness, +TritonBench also profiles efficiency performance on widely deployed GPUs +aligned with industry applications. Our study reveals that current +state-of-the-art code LLMs struggle to generate efficient Triton operators, +highlighting a significant gap in high-performance code generation. TritonBench +will be available at https://github.com/thunlp/TritonBench. -摘要:基礎模型在醫學影像方面的快速進展,代表著在加強診斷準確性和個人化治療方面邁出一大步。然而,基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查,包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距,特別是在可信度方面。此外,現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機,來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究,重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調,是因為與其他應用相比,它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰,總結了當前關注點和增強可信度的策略。此外,我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性,並倡導一種平衡的方法,既能促進創新,又能確保道德和公平的醫療保健服務。 +摘要:Triton 是一種高階的類 Python 語言,專門用於建構高效的 GPU 核心,由於其可移植性、靈活性及可存取性,已廣泛採用於深度學習框架中。然而,編程和並行最佳化仍需要 Triton 開發人員進行大量的試驗和錯誤。儘管大型語言模型 (LLM) 在傳統程式碼產生方面取得了進展,但這些模型在產生準確且效能最佳化的 Triton 程式碼時仍面臨困難,因為它們缺乏對其規格和 GPU 編程複雜性的認識。更重要的是,迫切需要針對 Triton 量身打造的系統性評估。在這項工作中,我們介紹 TritonBench,這是第一個針對 Triton 算子產生進行全面評比的基準。TritonBench 具有兩個評估管道:一組來自 GitHub 的 184 個真實世界算子,以及一組與 PyTorch 介面對齊的算子。與優先考慮功能正確性的傳統程式碼基準不同,TritonBench 還剖析了與產業應用對齊的廣泛部署 GPU 上的效能表現。我們的研究表明,目前最先進的程式碼 LLM 難以產生高效的 Triton 算子,突顯了高性能程式碼產生中的重大差距。TritonBench 將在 https://github.com/thunlp/TritonBench 提供。 -##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data** -2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan +##### **Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs** +2502.14748v1 by Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, Jordan Boyd-Graber -Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and -interpreting ultrasound scans right at the patient's bedside. However, the -expertise needed to interpret these images is considerable and may not always -be present in emergency situations. This reality makes algorithms such as -machine learning classifiers extremely valuable to augment human decisions. -POCUS devices are becoming available at a reasonable cost in the size of a -mobile phone. The challenge of turning POCUS devices into life-saving tools is -that interpretation of ultrasound images requires specialist training and -experience. Unfortunately, the difficulty to obtain positive training images -represents an important obstacle to building efficient and accurate -classifiers. Hence, the problem we try to investigate is how to explore -strategies to increase accuracy of classifiers trained with scarce data. We -hypothesize that training with a few data instances may not suffice for -classifiers to generalize causing them to overfit. Our approach uses an -Explainable AI-Augmented approach to help the algorithm learn more from less -and potentially help the classifier better generalize. +A common use of NLP is to facilitate the understanding of large document +collections, with a shift from using traditional topic models to Large Language +Models. Yet the effectiveness of using LLM for large corpus understanding in +real-world applications remains under-explored. This study measures the +knowledge users acquire with unsupervised, supervised LLM-based exploratory +approaches or traditional topic models on two datasets. While LLM-based methods +generate more human-readable topics and show higher average win probabilities +than traditional models for data exploration, they produce overly generic +topics for domain-specific datasets that do not easily allow users to learn +much about the documents. Adding human supervision to the LLM generation +process improves data exploration by mitigating hallucination and +over-genericity but requires greater human effort. In contrast, traditional. +models like Latent Dirichlet Allocation (LDA) remain effective for exploration +but are less user-friendly. We show that LLMs struggle to describe the haystack +of large corpora without human help, particularly domain-specific data, and +face scaling and hallucination limitations due to context length constraints. +Dataset available at https://huggingface. co/datasets/zli12321/Bills. -摘要:床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而,解讀這些影像所需的專業知識相當可觀,而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出,尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於,解讀超音波影像需要專門訓練和經驗。不幸的是,取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此,我們嘗試探討的問題是如何探索策略,以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括,導致它們過度擬合。我們的做法使用可解釋 AI 增強方法,以協助演算法從較少的資料中學習更多,並潛在協助分類器更好地概括。 +摘要:NLP 的常見用途是促進對大型文件集合的理解,從使用傳統主題模型轉向大型語言模型。然而,在現實世界的應用中使用 LLM 了解大型語料庫的有效性仍未得到充分探索。本研究衡量了使用者在兩個資料集上使用無監督、監督的基於 LLM 的探索性方法或傳統主題模型獲得的知識。雖然基於 LLM 的方法會產生更多人類可讀的主題,並且顯示出比傳統模型更高的平均獲勝機率,但它們會為特定領域的資料集產生過於通用的主題,而這些主題不容易讓使用者對文件有深入了解。在 LLM 生成過程中加入人類監督可透過減輕幻覺和過度泛化來改善資料探索,但需要更多的人力。相反地,傳統模型(如潛在狄利克雷配置 (LDA))仍然有效於探索,但使用者友善度較低。我們表明,LLM 難以在沒有人類幫助的情況下描述大型語料庫的乾草堆,特別是特定領域的資料,並且會因上下文長度限制而面臨擴充性和幻覺限制。資料集可於 https://huggingface.co/datasets/zli12321/Bills 取得。 -##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach** -2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang +##### **HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States** +2502.14744v1 by Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue -In recent years, the United States has witnessed a significant surge in the -popularity of vaping or e-cigarette use, leading to a notable rise in cases of -e-cigarette and vaping use-associated lung injury (EVALI) that caused -hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting -the urgency to comprehend vaping behaviors and develop effective strategies for -cessation. Due to the ubiquity of social media platforms, over 4.7 billion -users worldwide use them for connectivity, communications, news, and -entertainment with a significant portion of the discourse related to health, -thereby establishing social media data as an invaluable organic data resource -for public health research. In this study, we extracted a sample dataset from -one vaping sub-community on Reddit to analyze users' quit-vaping intentions. -Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit -vaping intention detection, this study compares the outcomes of this model -against layman and clinical expert annotations. Using different prompting -strategies such as zero-shot, one-shot, few-shot and chain-of-thought -prompting, we developed 8 prompts with varying levels of detail to explain the -task to GPT-4 and also evaluated the performance of the strategies against each -other. These preliminary findings emphasize the potential of GPT-4 in social -media data analysis, especially in identifying users' subtle intentions that -may elude human detection. +The integration of additional modalities increases the susceptibility of +large vision-language models (LVLMs) to safety risks, such as jailbreak +attacks, compared to their language-only counterparts. While existing research +primarily focuses on post-hoc alignment techniques, the underlying safety +mechanisms within LVLMs remain largely unexplored. In this work , we +investigate whether LVLMs inherently encode safety-relevant signals within +their internal activations during inference. Our findings reveal that LVLMs +exhibit distinct activation patterns when processing unsafe prompts, which can +be leveraged to detect and mitigate adversarial inputs without requiring +extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a +novel tuning-free framework that harnesses internal model activations to +enhance safety. Experimental results show that {HiddenDetect} surpasses +state-of-the-art methods in detecting jailbreak attacks against LVLMs. By +utilizing intrinsic safety-aware patterns, our method provides an efficient and +scalable solution for strengthening LVLM robustness against multimodal threats. +Our code will be released publicly at +https://github.com/leigest519/HiddenDetect. -摘要:近年來,美國見證了電子煙或電子香菸使用率大幅激增,導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加,在 2019 年 EVALI 爆發期間造成住院和死亡,凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及,全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂,其中很大一部分與健康相關,因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中,我們從 Reddit 上一個電子煙子社群中提取一個範例資料集,以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測,本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略,例如零次學習、一次學習、少次學習和思考鏈提示,我們開發了 8 個提示,詳細程度不同,向 GPT-4 解釋任務,並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力,特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。 +摘要:整合其他模态会增加大型视觉语言模型 (LVLMs) 对安全风险的敏感性,例如越狱攻击,与仅语言的对应模型相比。虽然现有的研究主要集中于事后对齐技术,但 LVLMs 内部的基本安全机制在很大程度上仍未得到探索。在这项工作中,我们调查了 LVLMs 在推理过程中是否在其内部激活中固有地编码了与安全相关的信号。我们的研究结果表明,LVLMs 在处理不安全提示时表现出不同的激活模式,这可以用来检测和缓解对抗性输入,而无需进行广泛的微调。基于这一见解,我们引入了 HiddenDetect,这是一个新颖的无调优框架,利用内部模型激活来增强安全性。实验结果表明,{HiddenDetect} 在检测针对 LVLMs 的越狱攻击方面超越了最先进的方法。通过利用内在的安全感知模式,我们的方法为加强 LVLM 对多模态威胁的鲁棒性提供了一种高效且可扩展的解决方案。我们的代码将在 https://github.com/leigest519/HiddenDetect 公开发布。 -##### **Towards Compositional Interpretability for XAI** -2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke +##### **Multi-Agent Coordination across Diverse Applications: A Survey** +2502.14743v1 by Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin-Teng Lin, Yang Shen -Artificial intelligence (AI) is currently based largely on black-box machine -learning models which lack interpretability. The field of eXplainable AI (XAI) -strives to address this major concern, being critical in high-stakes areas such -as the finance, legal and health sectors. - We present an approach to defining AI models and their interpretability based -on category theory. For this we employ the notion of a compositional model, -which sees a model in terms of formal string diagrams which capture its -abstract structure together with its concrete implementation. This -comprehensive view incorporates deterministic, probabilistic and quantum -models. We compare a wide range of AI models as compositional models, including -linear and rule-based models, (recurrent) neural networks, transformers, VAEs, -and causal and DisCoCirc models. - Next we give a definition of interpretation of a model in terms of its -compositional structure, demonstrating how to analyse the interpretability of a -model, and using this to clarify common themes in XAI. We find that what makes -the standard 'intrinsically interpretable' models so transparent is brought out -most clearly diagrammatically. This leads us to the more general notion of -compositionally-interpretable (CI) models, which additionally include, for -instance, causal, conceptual space, and DisCoCirc models. - We next demonstrate the explainability benefits of CI models. Firstly, their -compositional structure may allow the computation of other quantities of -interest, and may facilitate inference from the model to the modelled -phenomenon by matching its structure. Secondly, they allow for diagrammatic -explanations for their behaviour, based on influence constraints, diagram -surgery and rewrite explanations. Finally, we discuss many future directions -for the approach, raising the question of how to learn such meaningfully -structured models in practice. +Multi-agent coordination studies the underlying mechanism enabling the +trending spread of diverse multi-agent systems (MAS) and has received +increasing attention, driven by the expansion of emerging applications and +rapid AI advances. This survey outlines the current state of coordination +research across applications through a unified understanding that answers four +fundamental coordination questions: (1) what is coordination; (2) why +coordination; (3) who to coordinate with; and (4) how to coordinate. Our +purpose is to explore existing ideas and expertise in coordination and their +connections across diverse applications, while identifying and highlighting +emerging and promising research directions. First, general coordination +problems that are essential to varied applications are identified and analyzed. +Second, a number of MAS applications are surveyed, ranging from widely studied +domains, e.g., search and rescue, warehouse automation and logistics, and +transportation systems, to emerging fields including humanoid and +anthropomorphic robots, satellite systems, and large language models (LLMs). +Finally, open challenges about the scalability, heterogeneity, and learning +mechanisms of MAS are analyzed and discussed. In particular, we identify the +hybridization of hierarchical and decentralized coordination, human-MAS +coordination, and LLM-based MAS as promising future directions. -摘要:人工智慧(AI)目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧(XAI)領域致力於解決這個主要問題,這在金融、法律和健康等高風險領域至關重要。 -我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此,我們採用組合模型的概念,它以形式弦圖的形式看待模型,這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較,包括線性和基於規則的模型、(遞迴)神經網路、Transformer、VAE,以及因果和 DisCoCirc 模型。 -接下來,我們根據模型的組合結構給出模型解釋的定義,展示如何分析模型的可解釋性,並使用它來澄清 XAI 中的常見主題。我們發現,讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋(CI)模型概念,它另外還包括因果、概念空間和 DisCoCirc 模型。 -接下來,我們展示了 CI 模型的可解釋性優勢。首先,它們的組合結構允許計算其他感興趣的量,並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次,它們允許對其行為進行圖解說明,這些說明基於影響約束、圖解手術和重寫說明。最後,我們討論了這種方法的許多未來方向,提出了如何在實踐中學習這種有意義的結構化模型的問題。 +摘要:多智能體協調研究探討了促成各種多智能體系統 (MAS) 流行擴散的底層機制,並隨著新興應用擴展和 AI 快速進展而受到越來越多的關注。這項調查透過統一的理解來概述協調研究的現狀,回答了四個基本的協調問題:(1) 什麼是協調;(2) 為什麼協調;(3) 與誰協調;以及 (4) 如何協調。我們的目的是探索協調中現有的想法和專業知識,以及它們在不同應用中的關聯,同時找出並強調新興且有前景的研究方向。首先,找出並分析了對各種應用至關重要的協調問題。其次,調查了許多 MAS 應用,範圍從廣泛研究的領域(例如搜尋和救援、倉庫自動化和物流,以及運輸系統),到新興領域,包括人形機器人和擬人機器人、衛星系統和大語言模型 (LLM)。最後,分析並討論了有關 MAS 的可擴充性、異質性和學習機制的開放挑戰。特別是,我們將分層協調和分散式協調、人類-MAS 協調和基於 LLM 的 MAS 的混合視為有前景的未來方向。 -##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods** -2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen +##### **YOLOv12: A Breakdown of the Key Architectural Features** +2502.14740v1 by Mujadded Al Rabbani Alif, Muhammad Hussain -Machine learning models have achieved high overall accuracy in medical image -analysis. However, performance disparities on specific patient groups pose -challenges to their clinical utility, safety, and fairness. This can affect -known patient groups - such as those based on sex, age, or disease subtype - as -well as previously unknown and unlabeled groups. Furthermore, the root cause of -such observed performance disparities is often challenging to uncover, -hindering mitigation efforts. In this paper, to address these issues, we -leverage Slice Discovery Methods (SDMs) to identify interpretable -underperforming subsets of data and formulate hypotheses regarding the cause of -observed performance disparities. We introduce a novel SDM and apply it in a -case study on the classification of pneumothorax and atelectasis from chest -x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis -formulation and yields an explanation of previously observed but unexplained -performance disparities between male and female patients in widely used chest -X-ray datasets and models. Our findings indicate shortcut learning in both -classification tasks, through the presence of chest drains and ECG wires, -respectively. Sex-based differences in the prevalence of these shortcut -features appear to cause the observed classification performance gap, -representing a previously underappreciated interaction between shortcut -learning and model fairness analyses. +This paper presents an architectural analysis of YOLOv12, a significant +advancement in single-stage, real-time object detection building upon the +strengths of its predecessors while introducing key improvements. The model +incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and +FlashAttention-driven area-based attention, improving feature extraction, +enhanced efficiency, and robust detections. With multiple model variants, +similar to its predecessors, YOLOv12 offers scalable solutions for both +latency-sensitive and high-accuracy applications. Experimental results manifest +consistent gains in mean average precision (mAP) and inference speed, making +YOLOv12 a compelling choice for applications in autonomous systems, security, +and real-time analytics. By achieving an optimal balance between computational +efficiency and performance, YOLOv12 sets a new benchmark for real-time computer +vision, facilitating deployment across diverse hardware platforms, from edge +devices to high-performance clusters. -摘要:機器學習模型在醫學影像分析中已達到整體高準確度。然而,特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體(例如基於性別、年齡或疾病亞型)以及先前未知且未標籤的群體。此外,此類觀察到的效能差異的根本原因通常難以發現,阻礙了緩解措施。在本文中,為了解決這些問題,我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集,並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM,並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性,並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明,在分類任務中,透過胸腔引流管和心電圖導線的存在,存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異,似乎會導致觀察到的分類效能差距,這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。 +摘要:本文提出 YOLOv12 的架構分析,這是在單階段即時物件偵測領域的重大進展,它建立在前任的優勢之上,同時引入了關鍵改進。該模型結合了最佳化的主幹 (R-ELAN)、7x7 可分離卷積和 FlashAttention 驅動的基於區域的注意力,改進了特徵提取、增強了效率和穩健的偵測。與其前身類似,YOLOv12 具有多種模型變體,為低延遲敏感型和高準確度應用程式提供了可擴充的解決方案。實驗結果顯示在平均準確度 (mAP) 和推論速度方面都有顯著的提升,這使得 YOLOv12 成為自動化系統、安全性和即時分析應用程式的理想選擇。透過在運算效率和效能之間取得最佳平衡,YOLOv12 為即時電腦視覺樹立了新的基準,促進了在各種硬體平台(從邊緣裝置到高性能叢集)上的部署。 -##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health** -2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa +##### **SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines** +2502.14739v1 by M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang -The concept of Metaverse has attracted a lot of attention in various fields -and one of its important applications is health and treatment. The Metaverse -has enormous potential to transform healthcare by changing patient care, -medical education, and the way teaching/learning and research are done. The -purpose of this research is to provide an introduction to the basic concepts -and fundamental technologies of the Metaverse. This paper examines the pros and -cons of the Metaverse in healthcare context and analyzes its potential from the -technology and AI perspective. In particular, the role of machine learning -methods is discussed; We will explain how machine learning algorithms can be -applied to the Metaverse generated data to gain better insights in healthcare -applications. Additionally, we examine the future visions of the Metaverse in -health delivery, by examining emerging technologies such as blockchain and also -addressing privacy concerns. The findings of this study contribute to a deeper -understanding of the applications of Metaverse in healthcare and its potential -to revolutionize the delivery of medical services. +Large language models (LLMs) have demonstrated remarkable proficiency in +mainstream academic disciplines such as mathematics, physics, and computer +science. However, human knowledge encompasses over 200 specialized disciplines, +far exceeding the scope of existing benchmarks. The capabilities of LLMs in +many of these specialized fields-particularly in light industry, agriculture, +and service-oriented disciplines-remain inadequately evaluated. To address this +gap, we present SuperGPQA, a comprehensive benchmark that evaluates +graduate-level knowledge and reasoning capabilities across 285 disciplines. Our +benchmark employs a novel Human-LLM collaborative filtering mechanism to +eliminate trivial or ambiguous questions through iterative refinement based on +both LLM responses and expert feedback. Our experimental results reveal +significant room for improvement in the performance of current state-of-the-art +LLMs across diverse knowledge domains (e.g., the reasoning-focused model +DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting +the considerable gap between current model capabilities and artificial general +intelligence. Additionally, we present comprehensive insights from our +management of a large-scale annotation process, involving over 80 expert +annotators and an interactive Human-LLM collaborative system, offering valuable +methodological guidance for future research initiatives of comparable scope. -摘要:元宇宙的概念在各個領域都備受關注,其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育,以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點,並從技術和 AI 的角度分析其潛力。特別是,討論了機器學習方法的角色;我們將說明如何將機器學習演算法應用於元宇宙產生的資料,以獲得醫療保健應用方面的更佳見解。此外,我們透過探討區塊鏈等新興技術,並解決隱私問題,來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用,以及其在醫療服務提供方面發揮革命性變革的潛力。 +摘要:大型語言模型 (LLM) 已展現出在主流學術領域(如數學、物理和電腦科學)的卓越能力。然而,人類知識包含超過 200 個專業領域,遠遠超過現有基準的範圍。LLM 在許多這些專業領域(特別是在輕工業、農業和服務導向領域)的能力仍未得到充分評估。為了解決這個差距,我們提出了 SuperGPQA,這是一個綜合基準,用於評估 285 個領域的研究生級知識和推理能力。我們的基準採用新穎的人類-LLM 協同過濾機制,透過基於 LLM 回應和專家回饋的迭代改進,來消除瑣碎或模稜兩可的問題。我們的實驗結果顯示,當前最先進的 LLM 在不同知識領域的表現仍有很大的改進空間(例如,以推理為重點的模型 DeepSeek-R1 在 SuperGPQA 上達到了 61.82% 的最高準確度),突顯了當前模型能力與人工通用智慧之間的巨大差距。此外,我們從管理大型註釋過程(涉及 80 多位專家註釋者和一個互動式人類-LLM 協作系統)中提出了全面的見解,為未來具有可比規模的研究計畫提供了寶貴的方法論指導。 -##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI** -2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf +##### **EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration** +2502.14735v1 by Minjie Hong, Yan Xia, Zehan Wang, Jieming Zhu, Ye Wang, Sihang Cai, Xiaoda Yang, Quanyu Dai, Zhenhua Dong, Zhimeng Zhang, Zhou Zhao -Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with -no known ultimo cure and high morbidity. Research demonstrates that progressive -Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly -impacts kidney structure and functions, eventually leading to kidney failure. -With the progression of time, chronic kidney disease has moved from a -life-threatening disease affecting few people to a common disorder of varying -severity. The goal of this research is to visualize dominating features, -feature scores, and values exhibited for early prognosis and detection of CKD -using ensemble learning and explainable AI. For that, an AI-driven predictive -analytics approach is proposed to aid clinical practitioners in prescribing -lifestyle modifications for individual patients to reduce the rate of -progression of this disease. Our dataset is collected on body vitals from -individuals with CKD and healthy subjects to develop our proposed AI-driven -solution accurately. In this regard, blood and urine test results are provided, -and ensemble tree-based machine-learning models are applied to predict unseen -cases of CKD. Our research findings are validated after lengthy consultations -with nephrologists. Our experiments and interpretation results are compared -with existing explainable AI applications in various healthcare domains, -including CKD. The comparison shows that our developed AI models, particularly -the Random Forest model, have identified more features as significant -contributors than XgBoost. Interpretability (I), which measures the ratio of -important to masked features, indicates that our XgBoost model achieved a -higher score, specifically a Fidelity of 98\%, in this metric and naturally in -the FII index compared to competing models. +Large language models (LLMs) are increasingly leveraged as foundational +backbones in the development of advanced recommender systems, offering enhanced +capabilities through their extensive knowledge and reasoning. Existing +llm-based recommender systems (RSs) often face challenges due to the +significant differences between the linguistic semantics of pre-trained LLMs +and the collaborative semantics essential for RSs. These systems use +pre-trained linguistic semantics but learn collaborative semantics from scratch +via the llm-Backbone. However, LLMs are not designed for recommendations, +leading to inefficient collaborative learning, weak result correlations, and +poor integration of traditional RS features. To address these challenges, we +propose EAGER-LLM, a decoder-only llm-based generative recommendation framework +that integrates endogenous and exogenous behavioral and semantic information in +a non-intrusive manner. Specifically, we propose 1)dual-source knowledge-rich +item indices that integrates indexing sequences for exogenous signals, enabling +efficient link-wide processing; 2)non-invasive multiscale alignment +reconstruction tasks guide the model toward a deeper understanding of both +collaborative and semantic signals; 3)an annealing adapter designed to finely +balance the model's recommendation performance with its comprehension +capabilities. We demonstrate EAGER-LLM's effectiveness through rigorous testing +on three public benchmarks. -摘要:慢性腎臟病 (CKD) 是一種廣泛的慢性疾病,目前尚未找到最終的治療方法,且發病率很高。研究表明,進行性慢性腎臟病 (CKD) 是一種異質性疾病,會顯著影響腎臟結構和功能,最終導致腎衰竭。隨著時間的推移,慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值,以進行 CKD 的早期預後和檢測。為此,提出了一種 AI 驅動的預測分析方法,以幫助臨床醫生為個別患者開具生活方式的修改建議,以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的,以準確開發我們提出的 AI 驅動的解決方案。在這方面,提供了血液和尿液檢測結果,並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較,包括 CKD。比較表明,我們開發的 AI 模型,特別是隨機森林模型,已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率,表明我們的 XgBoost 模型在此指標中取得了更高的分數,特別是 98% 的保真度,並且在 FII 指數中自然高於競爭模型。 +摘要:大型語言模型(LLM)正日益被用作先進推薦系統開發中的基礎主幹,透過其廣泛的知識和推理能力提供增強功能。現有的基於 LLM 的推薦系統(RS)通常會因為預先訓練的 LLM 語言語義與 RS 必備的協作語義之間的顯著差異而面臨挑戰。這些系統使用預先訓練的語言語義,但透過 LLM 主幹從頭學習協作語義。然而,LLM 並非專為推薦而設計,導致協作學習效率低落、結果關聯性薄弱,以及與傳統 RS 功能整合不佳。為了應對這些挑戰,我們提出 EAGER-LLM,這是一種僅解碼器、基於 LLM 的生成推薦架構,能以非侵入性方式整合內生和外生行為和語義資訊。具體來說,我們提出 1) 雙來源、知識豐富的項目索引,它整合了外生訊號的索引序列,實現了高效的鏈路廣泛處理;2) 非侵入式多尺度對齊重建任務引導模型更深入地理解協作和語義訊號;3) 退火適配器旨在精細地平衡模型的推薦效能與其理解能力。我們透過在三個公共基準上的嚴格測試證明了 EAGER-LLM 的有效性。 -##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook** -2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan +##### **Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models** +2502.14734v1 by Hongji Li, Andrianos Michail, Reto Gubelmann, Simon Clematide, Juri Opitz -Mental health constitutes a complex and pervasive global challenge, affecting -millions of lives and often leading to severe consequences. In this paper, we -conduct a thorough survey to explore the intersection of data science, -artificial intelligence, and mental healthcare, focusing on the recent -developments of mental disorder detection through online social media (OSM). A -significant portion of the population actively engages in OSM platforms, -creating a vast repository of personal data that holds immense potential for -mental health analytics. The paper navigates through traditional diagnostic -methods, state-of-the-art data- and AI-driven research studies, and the -emergence of explainable AI (XAI) models for mental healthcare. We review -state-of-the-art machine learning methods, particularly those based on modern -deep learning, while emphasising the need for explainability in healthcare AI -models. The experimental design section provides insights into prevalent -practices, including available datasets and evaluation approaches. We also -identify key issues and challenges in the field and propose promising future -research directions. As mental health decisions demand transparency, -interpretability, and ethical considerations, this paper contributes to the -ongoing discourse on advancing XAI in mental healthcare through social media. -The comprehensive overview presented here aims to guide researchers, -practitioners, and policymakers in developing the area of mental disorder -detection. +We propose the Sentence Smith framework that enables controlled and specified +manipulation of text meaning. It consists of three main steps: 1. Parsing a +sentence into a semantic graph, 2. Applying human-designed semantic +manipulation rules, and 3. Generating text from the manipulated graph. A final +filtering step (4.) ensures the validity of the applied transformation. To +demonstrate the utility of Sentence Smith in an application study, we use it to +generate hard negative pairs that challenge text embedding models. Since the +controllable generation makes it possible to clearly isolate different types of +semantic shifts, we can gain deeper insights into the specific strengths and +weaknesses of widely used text embedding models, also addressing an issue in +current benchmarking where linguistic phenomena remain opaque. Human validation +confirms that the generations produced by Sentence Smith are highly accurate. -摘要:心理健康構成了一項複雜且普遍的全球挑戰,影響了數百萬人的生活,並經常導致嚴重的後果。在本文中,我們進行了一項徹底的調查,以探索數據科學、人工智慧和心理保健的交集,重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台,創造了一個龐大的人員資料庫,對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究,以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法,特別是那些基於現代深度學習的方法,同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解,包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰,並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量,本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。 +摘要:我們提出 Sentence Smith 框架,它能控制並指定文本含義的處理。它包含三個主要步驟:1. 將句子解析成語義圖形,2. 套用人為設計的語義處理規則,3. 從處理過的圖形生成文本。最後的過濾步驟 (4.) 確保套用轉換的有效性。為了在應用研究中展示 Sentence Smith 的效用,我們使用它來產生挑戰文本嵌入模型的困難負面對。由於可控生成能清楚地隔離不同類型的語義轉移,我們能更深入地了解廣泛使用的文本嵌入模型的具體優點和缺點,同時也解決了語言現象在當前基準測試中仍然不透明的問題。人為驗證確認 Sentence Smith 產生的生成高度準確。 -##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance** -2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou +##### **WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models** +2502.14727v1 by Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao -AI-aided clinical diagnosis is desired in medical care. Existing deep -learning models lack explainability and mainly focus on image analysis. The -recently developed Dynamic Uncertain Causality Graph (DUCG) approach is -causality-driven, explainable, and invariant across different application -scenarios, without problems of data collection, labeling, fitting, privacy, -bias, generalization, high cost and high energy consumption. Through close -collaboration between clinical experts and DUCG technicians, 46 DUCG models -covering 54 chief complaints were constructed. Over 1,000 diseases can be -diagnosed without triage. Before being applied in real-world, the 46 DUCG -models were retrospectively verified by third-party hospitals. The verified -diagnostic precisions were no less than 95%, in which the diagnostic precision -for every disease including uncommon ones was no less than 80%. After -verifications, the 46 DUCG models were applied in the real-world in China. Over -one million real diagnosis cases have been performed, with only 17 incorrect -diagnoses identified. Due to DUCG's transparency, the mistakes causing the -incorrect diagnoses were found and corrected. The diagnostic abilities of the -clinicians who applied DUCG frequently were improved significantly. Following -the introduction to the earlier presented DUCG methodology, the recommendation -algorithm for potential medical checks is presented and the key idea of DUCG is -extracted. +Retrieval Augmented Generation (RAG) has gained widespread adoption owing to +its capacity to empower large language models (LLMs) to integrate external +knowledge. However, existing RAG frameworks are primarily designed for +text-based LLMs and rely on Automatic Speech Recognition to process speech +input, which discards crucial audio information, risks transcription errors, +and increases computational overhead. Therefore, we introduce WavRAG, the first +retrieval augmented generation framework with native, end-to-end audio support. +WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw +audio for both embedding and retrieval. 2) WavRAG integrates audio and text +into a unified knowledge representation. Specifically, we propose the +WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge +base, and further enhance the in-context capabilities of spoken dialogue models +through the integration of chain-of-thought reasoning. In comparison to +state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval +performance while delivering a 10x acceleration. Furthermore, WavRAG's unique +text-audio hybrid retrieval capability extends the boundaries of RAG to the +audio modality. + +摘要:檢索增強生成 (RAG) 因其賦能大型語言模型 (LLM) 整合外部知識的能力而獲得廣泛採用。然而,現有的 RAG 框架主要設計用於基於文字的 LLM,並依賴自動語音辨識處理語音輸入,這會捨棄重要的音訊資訊、有轉錄錯誤的風險,並增加運算負擔。因此,我們引入了 WavRAG,這是第一個具備原生端對端音訊支援的檢索增強生成框架。WavRAG 提供兩個主要功能:1) 繞過 ASR,WavRAG 直接處理原始音訊以進行嵌入和檢索。2) WavRAG 將音訊和文字整合到統一的知識表示中。具體來說,我們提出了 WavRetriever 以利於從文字音訊混合知識庫中進行檢索,並透過整合思考鏈推理進一步增強對話模型的語境能力。與最先進的 ASR 文字 RAG 管線相比,WavRAG 達到了相當的檢索效能,同時提供了 10 倍的加速。此外,WavRAG 獨特的文字音訊混合檢索能力將 RAG 的界線延伸到音訊模式。 + +##### **Entity Framing and Role Portrayal in the News** +2502.14718v1 by Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov + +We introduce a novel multilingual hierarchical corpus annotated for entity +framing and role portrayal in news articles. The dataset uses a unique taxonomy +inspired by storytelling elements, comprising 22 fine-grained roles, or +archetypes, nested within three main categories: protagonist, antagonist, and +innocent. Each archetype is carefully defined, capturing nuanced portrayals of +entities such as guardian, martyr, and underdog for protagonists; tyrant, +deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for +innocents. The dataset includes 1,378 recent news articles in five languages +(Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two +critical domains of global significance: the Ukraine-Russia War and Climate +Change. Over 5,800 entity mentions have been annotated with role labels. This +dataset serves as a valuable resource for research into role portrayal and has +broader implications for news analysis. We describe the characteristics of the +dataset and the annotation process, and we report evaluation results on +fine-tuned state-of-the-art multilingual transformers and hierarchical +zero-shot learning using LLMs at the level of a document, a paragraph, and a +sentence. -摘要:醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性,並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的,並且在不同的應用場景中是不變的,沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作,構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前,46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%,其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後,46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例,僅發現 17 個不正確的診斷。由於 DUCG 的透明性,發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後,提出了潛在健康檢查的推薦演算法,並提取了 DUCG 的關鍵思想。 +摘要:我們引進一個新穎的多語言層級語料庫,其中註解了新聞文章中的實體框架和角色描繪。此資料集使用了一個獨特的分類法,其靈感來自講故事元素,包含 22 個細緻的角色或原型,嵌套在三個主要類別中:主角、對手和無辜者。每個原型都經過仔細定義,捕捉了實體的細微描繪,例如主角的監護人、烈士和弱者;對手的暴君、欺騙者和偏執狂;以及無辜者的受害者、替罪羊和被剝削者。該資料集包括五種語言(保加利亞語、英語、印地語、歐洲葡萄牙語和俄語)中的 1,378 篇近期新聞文章,重點關注兩個具有全球意義的關鍵領域:烏克蘭-俄羅斯戰爭和氣候變遷。超過 5,800 個實體提及已註解為角色標籤。此資料集作為角色描繪研究的寶貴資源,並對新聞分析有更廣泛的影響。我們描述了資料集的特徵和註解過程,並報告了對使用 LLM 在文件、段落和句子層級進行微調的最新多語言轉換器和層級零次學習的評估結果。 -##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability** -2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi +##### **From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT** +2502.14714v1 by Ahmed Abdeen Hamed, Byung Suk Lee -It is imperative that breast cancer is detected precisely and timely to -improve patient outcomes. Diagnostic methodologies have traditionally relied on -unimodal approaches; however, medical data analytics is integrating diverse -data sources beyond conventional imaging. Using multi-modal techniques, -integrating both image and non-image data, marks a transformative advancement -in breast cancer diagnosis. The purpose of this review is to explore the -burgeoning field of multimodal techniques, particularly the fusion of -histopathology images with non-image data. Further, Explainable AI (XAI) will -be used to elucidate the decision-making processes of complex algorithms, -emphasizing the necessity of explainability in diagnostic processes. This -review utilizes multi-modal data and emphasizes explainability to enhance -diagnostic accuracy, clinician confidence, and patient engagement, ultimately -fostering more personalized treatment strategies for breast cancer, while also -identifying research gaps in multi-modality and explainability, guiding future -studies, and contributing to the strategic direction of the field. +The generative capabilities of LLM models present opportunities in +accelerating tasks and concerns with the authenticity of the knowledge it +produces. To address the concerns, we present a computational approach that +systematically evaluates the factual accuracy of biomedical knowledge that an +LLM model has been prompted to generate. Our approach encompasses two +processes: the generation of disease-centric associations and the verification +of them using the semantic knowledge of the biomedical ontologies. Using +ChatGPT as the select LLM model, we designed a set of prompt-engineering +processes to generate linkages between diseases, drugs, symptoms, and genes to +establish grounds for assessments. Experimental results demonstrate high +accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and +genetic information (88%-98%). The symptom term identification accuracy was +notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO +ontologies accordingly. The verification of associations reveals literature +coverage rates of (89%-91%) among disease-drug and disease-gene associations. +The low identification accuracy for symptom terms also contributed to the +verification of symptom-related associations (49%-62%). -摘要:精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法;然而,醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術,標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域,特別是將組織病理學影像與非影像資料融合。此外,可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程,強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性,以提高診斷準確性、臨床醫師的信心和患者參與度,最終促進乳癌更個人化的治療策略,同時也找出多模式和可解釋性的研究差距,引導未來的研究,並為該領域的策略方向做出貢獻。 +摘要:LLM 模型的生成能力為加速任務和對其產生的知識真實性的疑慮提供了機會。為了解決這些疑慮,我們提出了計算方法,系統性評估 LLM 模型受提示而產生的生物醫學知識的事實準確性。我們的做法包括兩個過程:生成以疾病為中心的關聯,並使用生物醫學本体的語義知識驗證它們。使用 ChatGPT 作為選定的 LLM 模型,我們設計了一組提示工程流程,以生成疾病、藥物、症狀和基因之間的關聯,作為評估的依據。實驗結果證明在識別疾病術語 (88%-97%)、藥物名稱 (90%-91%) 和遺傳資訊 (88%-98%) 方面具有很高的準確性。症狀術語識別準確性顯著較低 (49%-61%),並根據 DOID、ChEBI、SYMPTOM 和 GO 本体進行驗證。關聯驗證顯示疾病-藥物和疾病-基因關聯的文獻覆蓋率為 (89%-91%)。症狀術語的低識別準確性也影響了症狀相關關聯的驗證 (49%-62%)。 -##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection** -2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya +##### **Data-Efficient Pretraining with Group-Level Data Influence Modeling** +2502.14709v1 by Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong -The neonatal period is the most vulnerable time for the development of -seizures. Seizures in the immature brain lead to detrimental consequences, -therefore require early diagnosis. The gold-standard for neonatal seizure -detection currently relies on continuous video-EEG monitoring; which involves -recording multi-channel electroencephalogram (EEG) alongside real-time video -monitoring within a neonatal intensive care unit (NICU). However, video-EEG -monitoring technology requires clinical expertise and is often limited to -technologically advanced and resourceful settings. Cost-effective new -techniques could help the medical fraternity make an accurate diagnosis and -advocate treatment without delay. In this work, a novel explainable deep -learning model to automate the neonatal seizure detection process with a -reduced EEG montage is proposed, which employs convolutional nets, graph -attention layers, and fully connected layers. Beyond its ability to detect -seizures in real-time with a reduced montage, this model offers the unique -advantage of real-time interpretability. By evaluating the performance on the -Zenodo dataset with 10-fold cross-validation, the presented model achieves an -absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall, -respectively. +Data-efficient pretraining has shown tremendous potential to elevate scaling +laws. This paper argues that effective pretraining data should be curated at +the group level, treating a set of data points as a whole rather than as +independent contributors. To achieve that, we propose Group-Level Data +Influence Modeling (Group-MATES), a novel data-efficient pretraining method +that captures and optimizes group-level data utility. Specifically, Group-MATES +collects oracle group-level influences by locally probing the pretraining model +with data sets. It then fine-tunes a relational data influence model to +approximate oracles as relationship-weighted aggregations of individual +influences. The fine-tuned model selects the data subset by maximizing its +group-level influence prediction, with influence-aware clustering to enable +efficient inference. Experiments on the DCLM benchmark demonstrate that +Group-MATES achieves a 10% relative core score improvement on 22 downstream +tasks over DCLM-Baseline and 5% over individual-influence-based methods, +establishing a new state-of-the-art. Further analyses highlight the +effectiveness of relational data influence models in capturing intricate +interactions between data points. -摘要:新生兒期是大腦發育最脆弱的時期,容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果,因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測;其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而,視訊腦電圖監控技術需要臨床專業知識,而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中,提出了一個新穎的可解釋深度學習模型,以自動化新生兒癲癇發作偵測過程,並採用減少的腦電圖裝置,其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外,此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能,所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。 +摘要:資料有效的預訓練已展現出提升規模化定律的巨大潛力。本文認為,有效的預訓練資料應在群組層級中進行策展,將資料點集合視為一個整體,而非獨立的貢獻者。為達成此目的,我們提出群組層級資料影響建模(Group-MATES),這是一種新穎的資料有效預訓練方法,可擷取和最佳化群組層級資料效用。具體而言,Group-MATES 透過使用資料集在區域探測預訓練模型,收集神諭群組層級影響。接著,微調關係資料影響模型,以關係加權聚合個別影響來近似神諭。微調模型透過最大化其群組層級影響預測,選取資料子集,並透過考量影響的群集,啟用有效率的推論。在 DCLM 基準上的實驗證明,與 DCLM-Baseline 相比,Group-MATES 在 22 個下游任務上達成 10% 的相對核心分數提升,並比基於個別影響的方法高出 5%,建立了新的技術水準。進一步的分析強調了關係資料影響模型在擷取資料點之間的複雜互動上的有效性。 -##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques** -2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik +##### **Human Misperception of Generative-AI Alignment: A Laboratory Experiment** +2502.14708v1 by Kevin He, Ran Shorrer, Mengjia Xia -Breast cancer (BC) stands as one of the most common malignancies affecting -women worldwide, necessitating advancements in diagnostic methodologies for -better clinical outcomes. This article provides a comprehensive exploration of -the application of Explainable Artificial Intelligence (XAI) techniques in the -detection and diagnosis of breast cancer. As Artificial Intelligence (AI) -technologies continue to permeate the healthcare sector, particularly in -oncology, the need for transparent and interpretable models becomes imperative -to enhance clinical decision-making and patient care. This review discusses the -integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and -others, with machine learning and deep learning models utilized in breast -cancer detection and classification. By investigating the modalities of breast -cancer datasets, including mammograms, ultrasounds and their processing with -AI, the paper highlights how XAI can lead to more accurate diagnoses and -personalized treatment plans. It also examines the challenges in implementing -these techniques and the importance of developing standardized metrics for -evaluating XAI's effectiveness in clinical settings. Through detailed analysis -and discussion, this article aims to highlight the potential of XAI in bridging -the gap between complex AI models and practical healthcare applications, -thereby fostering trust and understanding among medical professionals and -improving patient outcomes. +We conduct an incentivized laboratory experiment to study people's perception +of generative artificial intelligence (GenAI) alignment in the context of +economic decision-making. Using a panel of economic problems spanning the +domains of risk, time preference, social preference, and strategic +interactions, we ask human subjects to make choices for themselves and to +predict the choices made by GenAI on behalf of a human user. We find that +people overestimate the degree of alignment between GenAI's choices and human +choices. In every problem, human subjects' average prediction about GenAI's +choice is substantially closer to the average human-subject choice than it is +to the GenAI choice. At the individual level, different subjects' predictions +about GenAI's choice in a given problem are highly correlated with their own +choices in the same problem. We explore the implications of people +overestimating GenAI alignment in a simple theoretical model. -摘要:乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一,因此需要進步的診斷方法,以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域,特別是在腫瘤學中,透明且可解釋的模型需求變得勢在必行,以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合,例如 SHAP、LIME、Grad-CAM 等,以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式,包括乳房攝影、超音波及其在 AI 中的處理,本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰,以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論,本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力,進而促進醫療專業人員之間的信任與理解,並改善患者的結果。 +摘要:我們進行一項誘因實驗室實驗,以研究人們對生成式人工智慧 (GenAI) 在經濟決策制定中的對齊認知。使用涵蓋風險、時間偏好、社會偏好和策略性互動領域的經濟問題小組,我們要求受試者為自己做出選擇,並預測 GenAI 代表人類使用者做出的選擇。我們發現人們高估了 GenAI 選擇和人類選擇之間的對齊程度。在每個問題中,受試者對 GenAI 選擇的平均預測都比對 GenAI 選擇的預測更接近於平均人類受試者選擇。在個人層面上,不同受試者對特定問題中 GenAI 選擇的預測與他們在同一個問題中的選擇高度相關。我們在一個簡單的理論模型中探討了人們高估 GenAI 對齊的影響。 -##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition** -2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara +##### **Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting** +2502.14704v1 by Yuxuan Yang, Dalin Zhang, Yuxuan Liang, Hua Lu, Huan Li, Gang Chen -Speech emotion recognition (SER) has gained significant attention due to its -several application fields, such as mental health, education, and -human-computer interaction. However, the accuracy of SER systems is hindered by -high-dimensional feature sets that may contain irrelevant and redundant -information. To overcome this challenge, this study proposes an iterative -feature boosting approach for SER that emphasizes feature relevance and -explainability to enhance machine learning model performance. Our approach -involves meticulous feature selection and analysis to build efficient SER -systems. In addressing our main problem through model explainability, we employ -a feature evaluation loop with Shapley values to iteratively refine feature -sets. This process strikes a balance between model performance and -transparency, which enables a comprehensive understanding of the model's -predictions. The proposed approach offers several advantages, including the -identification and removal of irrelevant and redundant features, leading to a -more effective model. Additionally, it promotes explainability, facilitating -comprehension of the model's predictions and the identification of crucial -features for emotion determination. The effectiveness of the proposed method is -validated on the SER benchmarks of the Toronto emotional speech set (TESS), -Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of -Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion -(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our -knowledge, this is the first work to incorporate model explainability into an -SER framework. The source code of this paper is publicly available via this -https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition. +Time Series Forecasting (TSF) is a crucial task in various domains, yet +existing TSF models rely heavily on high-quality data and insufficiently +exploit all available data. This paper explores a novel self-supervised +approach to re-label time series datasets by inherently constructing candidate +datasets. During the optimization of a simple reconstruction network, +intermediates are used as pseudo labels in a self-supervised paradigm, +improving generalization for any predictor. We introduce the Self-Correction +with Adaptive Mask (SCAM), which discards overfitted components and selectively +replaces them with pseudo labels generated from reconstructions. Additionally, +we incorporate Spectral Norm Regularization (SNR) to further suppress +overfitting from a loss landscape perspective. Our experiments on eleven +real-world datasets demonstrate that SCAM consistently improves the performance +of various backbone models. This work offers a new perspective on constructing +datasets and enhancing the generalization of TSF models through self-supervised +learning. -摘要:語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而,SER 系統的準確性受到高維特徵集的阻礙,這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰,本研究提出了一種用於 SER 的迭代特徵提升方法,該方法強調特徵相關性和可解釋性,以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析,以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題,我們採用了具有 Shapley 值的特徵評估迴圈,以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡,這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點,包括識別和移除不相關和冗餘的特徵,從而建立更有效的模型。此外,它促進了可解釋性,有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證,其效能優於現有方法。據我們所知,這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得:https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。 +摘要:時間序列預測 (TSF) 在各個領域中都是一項重要的任務,但現有的 TSF 模型極度依賴高品質的資料,且無法充分利用所有可用的資料。本文探討了一種新穎的自監督方法,藉由內建地建構候選資料集來重新標記時間序列資料集。在最佳化一個簡單的重建網路過程中,中間產物會在自監督範例中作為偽標籤,進而改善任何預測器的概化能力。我們引入了帶有自適應遮罩 (SCAM) 的自我修正,它會捨棄過度擬合的組成,並選擇性地以從重建產生的偽標籤取代它們。此外,我們納入了頻譜範數正規化 (SNR) 來進一步抑制從損失景觀觀點來看產生的過度擬合。我們在 11 個真實世界的資料集上進行的實驗,證明 SCAM 持續改善各種主幹模型的效能。這項工作提供了建構資料集和透過自監督學習來提升 TSF 模型概化能力的新觀點。 -##### **The Explanation Necessity for Healthcare AI** -2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling +##### **I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search** +2502.14693v1 by Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu -Explainability is often critical to the acceptable implementation of -artificial intelligence (AI). Nowhere is this more important than healthcare -where decision-making directly impacts patients and trust in AI systems is -essential. This trust is often built on the explanations and interpretations -the AI provides. Despite significant advancements in AI interpretability, there -remains the need for clear guidelines on when and to what extent explanations -are necessary in the medical context. We propose a novel categorization system -with four distinct classes of explanation necessity, guiding the level of -explanation required: patient or sample (local) level, cohort or dataset -(global) level, or both levels. We introduce a mathematical formulation that -distinguishes these categories and offers a practical framework for researchers -to determine the necessity and depth of explanations required in medical AI -applications. Three key factors are considered: the robustness of the -evaluation protocol, the variability of expert observations, and the -representation dimensionality of the application. In this perspective, we -address the question: When does an AI medical application need to be explained, -and at what level of detail? +Recent advancements in large language models (LLMs) have shown remarkable +potential in automating machine learning tasks. However, existing LLM-based +agents often struggle with low-diversity and suboptimal code generation. While +recent work has introduced Monte Carlo Tree Search (MCTS) to address these +issues, limitations persist in the quality and diversity of thoughts generated, +as well as in the scalar value feedback mechanisms used for node selection. In +this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a +novel approach that iteratively expands tree nodes through an introspective +process that meticulously analyzes solutions and results from parent and +sibling nodes. This facilitates a continuous refinement of the node in the +search tree, thereby enhancing the overall decision-making process.Furthermore, +we integrate a Large Language Model (LLM)-based value model to facilitate +direct evaluation of each node's solution prior to conducting comprehensive +computational rollouts. A hybrid rewarding mechanism is implemented to +seamlessly transition the Q-value from LLM-estimated scores to actual +performance scores. This allows higher-quality nodes to be traversed +earlier.Applied to the various ML tasks, our approach demonstrates a6\% +absolute improvement in performance compared to the strong open-source AutoML +agents, showcasing its effectiveness in enhancing agentic AutoML systems. -摘要:可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域,这一点尤为重要,因为决策直接影响患者,并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展,但仍然需要明确的指导方针,说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统,该系统具有四种不同的解释必要性类别,指导所需的解释级别:患者或样本(局部)级别、队列或数据集(全局)级别,或两个级别。我们引入了一个数学公式,该公式区分了这些类别,并为研究人员提供了一个实用框架,以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素:评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看,我们解决了这个问题:AI 医疗应用何时需要解释,以及需要解释到何种程度? +摘要:大型語言模型 (LLM) 的最新進展已展現出自動化機器學習任務的顯著潛力。然而,現有的基於 LLM 的代理通常會遇到低多樣性和次優代碼生成的問題。雖然最近的工作已引入蒙地卡羅樹搜尋 (MCTS) 來解決這些問題,但仍存在於所產生想法的品質和多樣性,以及用於節點選擇的標量值回饋機制中。在本研究中,我們介紹了內省蒙地卡羅樹搜尋 (I-MCTS),這是一種透過內省過程反覆擴展樹節點的新方法,該過程會細緻地分析來自父節點和同層節點的解決方案和結果。這有助於持續改善搜尋樹中的節點,進而增強整體決策制定過程。此外,我們整合了一個基於大型語言模型 (LLM) 的值模型,以便在進行全面運算展開之前直接評估每個節點的解決方案。實作了一種混合獎勵機制,以無縫地將 Q 值從 LLM 估計分數轉換為實際效能分數。這允許較高品質的節點更早被遍歷。應用於各種 ML 任務,我們的做法展示出比強大的開源 AutoML 代理高出 6% 的絕對效能提升,證明了其在增強代理式 AutoML 系統方面的有效性。 -##### **Interdisciplinary Expertise to Advance Equitable Explainable AI** -2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles +##### **Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup** +2502.14682v1 by Yonghui Kong, Hongbing Hu, Dan Zhang, Siyuan Chai, Fan Zhang, Wei Wang -The field of artificial intelligence (AI) is rapidly influencing health and -healthcare, but bias and poor performance persists for populations who face -widespread structural oppression. Previous work has clearly outlined the need -for more rigorous attention to data representativeness and model performance to -advance equity and reduce bias. However, there is an opportunity to also -improve the explainability of AI by leveraging best practices of social -epidemiology and health equity to help us develop hypotheses for associations -found. In this paper, we focus on explainable AI (XAI) and describe a framework -for interdisciplinary expert panel review to discuss and critically assess AI -model explanations from multiple perspectives and identify areas of bias and -directions for future research. We emphasize the importance of the -interdisciplinary expert panel to produce more accurate, equitable -interpretations which are historically and contextually informed. -Interdisciplinary panel discussions can help reduce bias, identify potential -confounders, and identify opportunities for additional research where there are -gaps in the literature. In turn, these insights can suggest opportunities for -AI model improvement. +Large language models have demonstrated excellent performance in many tasks, +including Text-to-SQL, due to their powerful in-context learning capabilities. +They are becoming the mainstream approach for Text-to-SQL. However, these +methods still have a significant gap compared to human performance, especially +on complex questions. As the complexity of questions increases, the gap between +questions and SQLs increases. We identify two important gaps: the structural +mapping gap and the lexical mapping gap. To tackle these two gaps, we propose +PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates +gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM). +AQP aims to obtain the structural pattern of the question by removing +database-related information, which enables us to find structurally similar +demonstrations. CSM aims to associate database-related text span in the +question with specific tables or columns in the database, which alleviates the +lexical mapping gap. Experimental results on the Spider and BIRD datasets +demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL + +GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution +accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an +execution accuracy of 64.67\%. + +摘要:大型語言模型在許多任務中表現出色,包括文字轉 SQL,這歸功於它們強大的情境學習能力。它們正成為文字轉 SQL 的主流方法。然而,這些方法與人類的表現仍有顯著差距,特別是在複雜的問題上。隨著問題的複雜性增加,問題和 SQL 之間的差距也隨之增加。我們找出兩個重要的差距:結構對應差距和詞彙對應差距。為了解決這兩個差距,我們提出 PAS-SQL,一種基於 LLM 的高效 SQL 產生管道,它透過抽象查詢模式 (AQP) 和情境架構標記 (CSM) 來縮小差距。AQP 旨在透過移除與資料庫相關的資訊來取得問題的結構模式,這使我們能夠找到結構上相似的範例。CSM 旨在將問題中與資料庫相關的文字範圍與資料庫中的特定表格或欄位關聯起來,這可以縮小詞彙對應差距。在 Spider 和 BIRD 資料集上的實驗結果證明了我們所提出的方法的有效性。具體來說,PAS-SQL + GPT-4o 在 Spider 基準測試中設定了一個新的技術水準,執行準確度為 87.9%,並在 BIRD 資料集上取得領先的結果,執行準確度為 64.67%。 + +##### **How to Get Your LLM to Generate Challenging Problems for Evaluation** +2502.14678v1 by Arkil Patel, Siva Reddy, Dzmitry Bahdanau + +The pace of evolution of Large Language Models (LLMs) necessitates new +approaches for rigorous and comprehensive evaluation. Traditional human +annotation is increasingly impracticable due to the complexities and costs +involved in generating high-quality, challenging problems. In this work, we +introduce CHASE, a unified framework to synthetically generate challenging +problems using LLMs without human involvement. For a given task, our approach +builds a hard problem in a bottom-up manner from simpler components. Moreover, +our framework decomposes the generation process into independently verifiable +sub-tasks, thereby ensuring a high level of quality and correctness. We +implement CHASE to create evaluation benchmarks across three diverse domains: +(1) document-based question answering, (2) repository-level code completion, +and (3) math reasoning. The performance of state-of-the-art LLMs on these +synthetic benchmarks lies in the range of 40-60% accuracy, thereby +demonstrating the effectiveness of our framework at generating challenging +problems. We publicly release our benchmarks and code. -摘要:人工智慧 (AI) 領域正快速影響著健康與醫療保健,但對於面臨廣泛結構性壓迫的人群來說,偏見和不良表現依然存在。先前的研究已清楚說明,需要更嚴格地注意資料代表性和模型效能,以促進公平性並減少偏見。然而,我們有機會透過運用社會流行病學和健康公平的最佳實務,來改善 AI 的可解釋性,以幫助我們針對發現的關聯性,發展假設。在本文中,我們專注於可解釋 AI (XAI),並描述一個跨領域專家小組審查架構,以從多重觀點討論和批判性評估 AI 模型的解釋,並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要,而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素,並在文獻中有缺口時找出額外研究的機會。反過來,這些見解可以建議 AI 模型改進的機會。 +摘要:大型語言模型 (LLM) 的演化速度需要新的方法來進行嚴謹且全面的評估。由於產生高品質、具挑戰性的問題所涉及的複雜性和成本,傳統的人工標註正變得越來越不可行。在這項工作中,我們介紹了 CHASE,一個統一的框架,用於使用 LLM 合成產生具有挑戰性的問題,而無需人工參與。對於給定的任務,我們的做法是以自下而上的方式從更簡單的組成部分來建立一個困難的問題。此外,我們的框架將生成過程分解為獨立可驗證的子任務,從而確保高品質和正確性。我們實作 CHASE 來建立三個不同領域的評估基準:(1) 基於文件的問答、(2) 儲存庫層級的程式碼完成,以及 (3) 數學推理。最先進的 LLM 在這些合成基準上的效能落在 40-60% 的準確度範圍內,從而證明了我們的框架在產生具有挑戰性的問題上的有效性。我們公開發布我們的基準和程式碼。 -##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts** -2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen +##### **Data-Constrained Synthesis of Training Data for De-Identification** +2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis -Artificial Intelligence (AI) repeatedly match or outperform radiologists in -lab experiments. However, real-world implementations of radiological AI-based -systems are found to provide little to no clinical value. This paper explores -how to design AI for clinical usefulness in different contexts. We conducted 19 -design sessions and design interventions with 13 radiologists from 7 clinical -sites in Denmark and Kenya, based on three iterations of a functional AI-based -prototype. Ten sociotechnical dependencies were identified as crucial for the -design of AI in radiology. We conceptualised four technical dimensions that -must be configured to the intended clinical context of use: AI functionality, -AI medical focus, AI decision threshold, and AI Explainability. We present four -design recommendations on how to address dependencies pertaining to the medical -knowledge, clinic type, user expertise level, patient context, and user -situation that condition the configuration of these technical dimensions. +Many sensitive domains -- such as the clinical domain -- lack widely +available datasets due to privacy risks. The increasing generative capabilities +of large language models (LLMs) have made synthetic datasets a viable path +forward. In this study, we domain-adapt LLMs to the clinical domain and +generate synthetic clinical texts that are machine-annotated with tags for +personally identifiable information using capable encoder-based NER models. The +synthetic corpora are then used to train synthetic NER models. The results show +that training NER models using synthetic corpora incurs only a small drop in +predictive performance. The limits of this process are investigated in a +systematic ablation study -- using both Swedish and Spanish data. Our analysis +shows that smaller datasets can be sufficient for domain-adapting LLMs for data +synthesis. Instead, the effectiveness of this process is almost entirely +contingent on the performance of the machine-annotating NER models trained +using the original data. -摘要:人工智慧(AI)在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而,發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代,在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向,必須根據預期的臨床使用情境進行設定:AI 功能、AI 醫療重點、AI 決策門檻,以及 AI 可解釋性。我們提出四項設計建議,說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境,以及影響這些技術面向設定的使用者情境相關的依賴關係。 +摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 -##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making** -2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah +##### **BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction** +2502.14676v1 by Ruochen Li, Stamos Katsigiannis, Tae-Kyun Kim, Hubert P. H. Shum -With advanced AI/ML, there has been growing research on explainable AI (XAI) -and studies on how humans interact with AI and XAI for effective human-AI -collaborative decision-making. However, we still have a lack of understanding -of how AI systems and XAI should be first presented to users without technical -backgrounds. In this paper, we present the findings of semi-structured -interviews with health professionals (n=12) and students (n=4) majoring in -medicine and health to study how to improve onboarding with AI and XAI. For the -interviews, we built upon human-AI interaction guidelines to create onboarding -materials of an AI system for stroke rehabilitation assessment and AI -explanations and introduce them to the participants. Our findings reveal that -beyond presenting traditional performance metrics on AI, participants desired -benchmark information, the practical benefits of AI, and interaction trials to -better contextualize AI performance, and refine the objectives and performance -of AI. Based on these findings, we highlight directions for improving -onboarding with AI and XAI and human-AI collaborative decision-making. +Trajectory prediction allows better decision-making in applications of +autonomous vehicles or surveillance by predicting the short-term future +movement of traffic agents. It is classified into pedestrian or heterogeneous +trajectory prediction. The former exploits the relatively consistent behavior +of pedestrians, but is limited in real-world scenarios with heterogeneous +traffic agents such as cyclists and vehicles. The latter typically relies on +extra class label information to distinguish the heterogeneous agents, but such +labels are costly to annotate and cannot be generalized to represent different +behaviors within the same class of agents. In this work, we introduce the +behavioral pseudo-labels that effectively capture the behavior distributions of +pedestrians and heterogeneous agents solely based on their motion features, +significantly improving the accuracy of trajectory prediction. To implement the +framework, we propose the Behavioral Pseudo-Label Informed Sparse Graph +Convolution Network (BP-SGCN) that learns pseudo-labels and informs to a +trajectory predictor. For optimization, we propose a cascaded training scheme, +in which we first learn the pseudo-labels in an unsupervised manner, and then +perform end-to-end fine-tuning on the labels in the direction of increasing the +trajectory prediction accuracy. Experiments show that our pseudo-labels +effectively model different behavior clusters and improve trajectory +prediction. Our proposed BP-SGCN outperforms existing methods using both +pedestrian (ETH/UCY, pedestrian-only SDD) and heterogeneous agent datasets +(SDD, Argoverse 1). -摘要:隨著先進的 AI/ML,對可解釋 AI (XAI) 的研究不斷增加,以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而,我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中,我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果,以研究如何改善 AI 和 XAI 的入門。對於訪談,我們建立在人機互動準則之上,為中風康復評估和 AI 解釋的 AI 系統創建入門材料,並將它們介紹給參與者。我們的研究結果表明,除了呈現傳統的 AI 性能指標外,參與者還希望基准信息、AI 的實際好處以及交互試驗,以更好地將 AI 性能情境化,並完善 AI 的目標和性能。根據這些發現,我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。 +摘要:軌跡預測允許在自動駕駛車輛或監視應用中做出更好的決策,藉由預測交通代理的短期未來移動。它被分類為行人或異質軌跡預測。前者利用行人相對一致的行為,但受限於與自行車騎士和車輛等異質交通代理的真實世界場景。後者通常依賴額外的類別標籤資訊來區分異質代理,但此類標籤的註解成本很高,且無法概括為表示同一類別代理中的不同行為。在這項工作中,我們引入了行為偽標籤,它僅根據行人和異質代理的運動特徵有效捕捉行為分佈,顯著提升軌跡預測的準確度。為實作架構,我們提出了行為偽標籤告知稀疏圖形卷積網路 (BP-SGCN),它學習偽標籤並告知軌跡預測器。針對最佳化,我們提出了一種串聯訓練方案,其中我們首先以非監督的方式學習偽標籤,然後在標籤上執行端到端微調,朝著提升軌跡預測準確度的方向進行。實驗顯示我們的偽標籤有效建模不同的行為叢集,並提升軌跡預測。我們提出的 BP-SGCN 使用行人 (ETH/UCY,僅限行人的 SDD) 和異質代理資料集 (SDD,Argoverse 1) 都優於現有方法。 -##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach** -2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao +##### **Explanations of Deep Language Models Explain Language Representations in the Brain** +2502.14671v1 by Maryam Rahimi, Yadollah Yaghoobzadeh, Mohammad Reza Daliri -This article uses machine learning (ML) and explainable artificial -intelligence (XAI) techniques to investigate the relationship between -nutritional status and mortality rates associated with Alzheimers disease (AD). -The Third National Health and Nutrition Examination Survey (NHANES III) -database is employed for analysis. The random forest model is selected as the -base model for XAI analysis, and the Shapley Additive Explanations (SHAP) -method is used to assess feature importance. The results highlight significant -nutritional factors such as serum vitamin B12 and glycated hemoglobin. The -study demonstrates the effectiveness of random forests in predicting AD -mortality compared to other diseases. This research provides insights into the -impact of nutrition on AD and contributes to a deeper understanding of disease -progression. +Recent advances in artificial intelligence have given rise to large language +models (LLMs) that not only achieve human-like performance but also share +computational principles with the brain's language processing mechanisms. While +previous research has primarily focused on aligning LLMs' internal +representations with neural activity, we introduce a novel approach that +leverages explainable AI (XAI) methods to forge deeper connections between the +two domains. Using attribution methods, we quantified how preceding words +contribute to an LLM's next-word predictions and employed these explanations to +predict fMRI recordings from participants listening to the same narratives. Our +findings demonstrate that attribution methods robustly predict brain activity +across the language network, surpassing traditional internal representations in +early language areas. This alignment is hierarchical: early-layer explanations +correspond to the initial stages of language processing in the brain, while +later layers align with more advanced stages. Moreover, the layers more +influential on LLM next-word prediction$\unicode{x2014}$those with higher +attribution scores$\unicode{x2014}$exhibited stronger alignment with neural +activity. This work establishes a bidirectional bridge between AI and +neuroscience. First, we demonstrate that attribution methods offer a powerful +lens for investigating the neural mechanisms of language comprehension, +revealing how meaning emerges from preceding context. Second, we propose using +brain alignment as a metric to evaluate the validity of attribution methods, +providing a framework for assessing their biological plausibility. -摘要:本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型,並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素,例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解,並有助於更深入地了解疾病的進展。 +摘要:最近的人工智能的進展產生了大型語言模型 (LLM),它不僅達到類似人類的表現,還與大腦的語言處理機制共享計算原理。雖然先前的研究主要集中於將 LLM 的內部表徵與神經活動對齊,但我們引入了一種新穎的方法,該方法利用可解釋 AI (XAI) 方法在兩個域之間建立更深層的聯繫。使用歸因方法,我們量化了前一個單詞如何促成 LLM 的下一個單詞預測,並利用這些解釋來預測參與者在聆聽相同敘述時的大腦功能性磁共振造影 (fMRI) 記錄。我們的發現表明,歸因方法可以穩健地預測整個語言網路中的大腦活動,超越了早期語言區域中的傳統內部表徵。這種對齊是分層的:早期層次解釋對應於大腦中語言處理的初始階段,而後續層次則與更進階的階段對齊。此外,對 LLM 下一個單詞預測影響力較大的層次(即歸因分數較高的層次)表現出與神經活動更強的對齊。這項工作在 AI 與神經科學之間建立了一個雙向橋樑。首先,我們證明歸因方法提供了一個強大的視角,用於研究語言理解的神經機制,揭示意義如何從先前的脈絡中產生。其次,我們建議使用大腦對齊作為評估歸因方法有效性的指標,提供了一個評估其生物學合理性的框架。 -##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone** -2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath +##### **AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO** +2502.14669v1 by Alan Dao, Dinh Bach Vu -Primary care providers are vital for initial triage and referrals to -specialty care. In glaucoma, asymptomatic and fast progression can lead to -vision loss, necessitating timely referrals to specialists. However, primary -eye care providers may not identify urgent cases, potentially delaying care. -Artificial Intelligence (AI) offering explanations could enhance their referral -decisions. We investigate how various AI explanations help providers -distinguish between patients needing immediate or non-urgent specialist -referrals. We built explainable AI algorithms to predict glaucoma surgery needs -from routine eyecare data as a proxy for identifying high-risk patients. We -incorporated intrinsic and post-hoc explainability and conducted an online -study with optometrists to assess human-AI team performance, measuring referral -accuracy and analyzing interactions with AI, including agreement rates, task -time, and user experience perceptions. AI support enhanced referral accuracy -among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams -underperformed compared to AI alone. Participants believed they included AI -advice more when using the intrinsic model, and perceived it more useful and -promising. Without explanations, deviations from AI recommendations increased. -AI support did not increase workload, confidence, and trust, but reduced -challenges. On a separate test set, our black-box and intrinsic models achieved -an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We -identify opportunities of human-AI teaming for glaucoma management in primary -eye care, noting that while AI enhances referral accuracy, it also shows a -performance gap compared to AI alone, even with explanations. Human involvement -remains essential in medical decision making, underscoring the need for future -research to optimize collaboration, ensuring positive experiences and safe AI -use. +Large Language Models (LLMs) have demonstrated impressive capabilities in +language processing, yet they often struggle with tasks requiring genuine +visual spatial reasoning. In this paper, we introduce a novel two-stage +training framework designed to equip standard LLMs with visual reasoning +abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) +on a curated dataset of tokenized maze representations to teach the model to +predict step-by-step movement commands. Next, we apply Group Relative Policy +Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted +reward function to refine the model's sequential decision-making and encourage +emergent chain-of-thought behaviors. Experimental results on synthetically +generated mazes show that while a baseline model fails to navigate the maze, +the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning +boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more +robust and self-corrective reasoning, highlighting the potential of our +approach to bridge the gap between language models and visual spatial tasks. +These findings offer promising implications for applications in robotics, +autonomous navigation, and other domains that require integrated visual and +sequential reasoning. -摘要:初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下,無症狀且快速惡化可能導致視力喪失,因此需要及時轉診給專家。然而,初級眼科保健提供者可能無法識別緊急情況,可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法,以從例行眼科護理資料預測青光眼手術需求,作為識別高風險患者的代理。我們納入了內在和事後解釋性,並與驗光師進行了一項線上研究,以評估人機團隊的表現,衡量轉診準確度並分析與 AI 的互動,包括同意率、任務時間和使用者體驗感知。在 87 名參與者中,AI 支援提高了轉診準確度(使用 AI/未使用的比例為 59.9%/50.8%),儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議,並認為它更有用且更有希望。沒有解釋,AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任,但減少了挑戰。在一個單獨的測試集中,我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中,人機團隊合作管理青光眼的機會,並注意到雖然 AI 提高了轉診準確度,但即使有解釋,它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要,這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。 +摘要:大型語言模型(LLM)在語言處理方面展現出令人印象深刻的能力,但它們經常難以應付需要真正視覺空間推理的任務。在本文中,我們介紹了一種新穎的兩階段訓練架構,旨在為標準 LLM 提供迷宮導航的視覺推理能力。首先,我們在標記化迷宮表示的策展資料集上利用監督微調(SFT)來教導模型預測逐步移動指令。接下來,我們使用 DeepSeekR1 中使用的技術,即群體相對策略最佳化(GRPO),並搭配精心設計的獎勵函數來優化模型的順序決策制定,並鼓勵出現連貫的思考行為。在合成產生的迷宮上進行的實驗結果顯示,雖然基準模型無法導航迷宮,但經過 SFT 訓練的模型達到 86% 的準確度,而進一步的 GRPO 微調將準確度提升至 93%。定性分析顯示,GRPO 促進更強健且自我修正的推理,凸顯了我們的方法在彌合語言模型與視覺空間任務之間差距的潛力。這些發現為機器人、自主導航和其他需要整合視覺和順序推理的領域的應用提供了有希望的啟示。 -##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery** -2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang +##### **InstructAgent: Building User Controllable Recommender via LLM Agent** +2502.14662v1 by Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang -In medical imaging, particularly in early disease detection and prognosis -tasks, discerning the rationale behind an AI model's predictions is crucial for -evaluating the reliability of its decisions. Conventional explanation methods -face challenges in identifying discernible decisive features in medical image -classifications, where discriminative features are subtle or not immediately -apparent. To bridge this gap, we propose an explainable model that is equipped -with both decision reasoning and feature identification capabilities. Our -approach not only detects influential image patterns but also uncovers the -decisive features that drive the model's final predictions. By implementing our -method, we can efficiently identify and visualise class-specific features -leveraged by the data-driven model, providing insights into the decision-making -processes of deep learning models. We validated our model in the demanding -realm of medical prognosis task, demonstrating its efficacy and potential in -enhancing the reliability of AI in healthcare and in discovering new knowledge -in diseases where prognostic understanding is limited. +Traditional recommender systems usually take the user-platform paradigm, +where users are directly exposed under the control of the platform's +recommendation algorithms. However, the defect of recommendation algorithms may +put users in very vulnerable positions under this paradigm. First, many +sophisticated models are often designed with commercial objectives in mind, +focusing on the platform's benefits, which may hinder their ability to protect +and capture users' true interests. Second, these models are typically optimized +using data from all users, which may overlook individual user's preferences. +Due to these shortcomings, users may experience several disadvantages under the +traditional user-platform direct exposure paradigm, such as lack of control +over the recommender system, potential manipulation by the platform, echo +chamber effects, or lack of personalization for less active users due to the +dominance of active users during collaborative learning. Therefore, there is an +urgent need to develop a new paradigm to protect user interests and alleviate +these issues. Recently, some researchers have introduced LLM agents to simulate +user behaviors, these approaches primarily aim to optimize platform-side +performance, leaving core issues in recommender systems unresolved. To address +these limitations, we propose a new user-agent-platform paradigm, where agent +serves as the protective shield between user and recommender system that +enables indirect exposure. To this end, we first construct four recommendation +datasets, denoted as $\dataset$, along with user instructions for each record. -摘要:在醫學影像中,特別是在早期疾病檢測和預後任務中,辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰,其中區別性特徵很微妙或並不明顯。為了彌合這一差距,我們提出了一個可解釋的模型,該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式,還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型,我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵,從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型,展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。 +摘要:傳統推薦系統通常採用使用者-平台範例, +其中使用者直接暴露在平台推薦演算法的控制之下。然而,推薦演算法的缺陷可能會讓使用者在這個範例中處於非常脆弱的位置。首先,許多精密的模型通常在設計時就考慮到商業目標,專注於平台的利益,這可能會阻礙它們保護和掌握使用者真正興趣的能力。其次,這些模型通常使用所有使用者的資料進行最佳化,這可能會忽略個別使用者的偏好。由於這些缺點,使用者可能會在傳統使用者-平台直接暴露範例中遇到一些缺點,例如缺乏對推薦系統的控制、平台的潛在操縱、同溫層效應,或由於活躍使用者在協作學習中的主導地位而缺乏針對較不活躍使用者的個人化。因此,迫切需要開發一種新的範例來保護使用者利益並緩解這些問題。最近,一些研究人員引入了 LLM 代理程式來模擬使用者行為,這些方法主要旨在最佳化平台端的效能,而未解決推薦系統中的核心問題。為了解決這些限制,我們提出了一種新的使用者-代理程式-平台範例,其中代理程式作為使用者和推薦系統之間的保護盾,實現間接暴露。為此,我們首先構建了四個推薦資料集,表示為 $\dataset$,以及每條記錄的使用者說明。 -##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach** -2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat +##### **Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs** +2502.14645v1 by Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao -This study explores the relationship between informational support seeking -questions, responses, and helpfulness ratings in online health communities. We -created a labeled data set of question-response pairs and developed multimodal -machine learning and deep learning models to reliably predict informational -support questions and responses. We employed explainable AI to reveal the -emotions embedded in informational support exchanges, demonstrating the -importance of emotion in providing informational support. This complex -interplay between emotional and informational support has not been previously -researched. The study refines social support theory and lays the groundwork for -the development of user decision aids. Further implications are discussed. +Knowledge editing allows for efficient adaptation of large language models +(LLMs) to new information or corrections without requiring full retraining. +However, prior methods typically focus on either single-language editing or +basic multilingual editing, failing to achieve true cross-linguistic knowledge +synchronization. To address this, we present a simple and practical +state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), +designed to propagate knowledge from a dominant language to other languages +effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition +Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel +dataset to modify in-scope knowledge while preserving unrelated information, +and (ii) Target-language Preference Optimization (TL-PO), which applies +advanced optimization techniques to ensure consistency across languages, +fostering the transfer of updates. Additionally, we contribute a high-quality, +cross-lingual dataset, specifically designed to enhance knowledge transfer +across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks +show that X-KDE significantly enhances cross-lingual performance, achieving an +average improvement of +8.19%, while maintaining high accuracy in monolingual +settings. + +摘要:知識編輯允許大語言模型 (LLM) 有效地適應新資訊或修正,而無需進行完整的再訓練。 +然而,先前的做法通常專注於單一語言編輯或基本的語音編輯,未能實現真正的跨語言知識同步。為了解決這個問題,我們提出了一個簡單且實用的最先進 (SOTA) 配方,即跨語言知識民主編輯 (X-KDE),旨在有效地從主導語言傳播知識到其他語言。我們的 X-KDE 包含兩個階段:(i) 跨語言版本指令調整 (XE-IT),它微調模型,在經過整理的平行資料集上修改範圍內的知識,同時保留不相關的資訊,以及 (ii) 目標語言偏好最佳化 (TL-PO),它應用先進的最佳化技術,以確保跨語言的一致性,促進更新的傳輸。此外,我們貢獻了一個高品質的跨語言資料集,特別設計用於增強跨語言的知識傳輸。在 Bi-ZsRE 和 MzsRE 基準上的廣泛實驗表明,X-KDE 大幅提升了跨語言效能,在單語言設定中維持高準確度的同時,平均提升了 +8.19%。 -摘要:本研究探討線上健康社群中尋求資訊支持的問題、回應,以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集,並開發了多模態機器學習和深度學習模型,以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒,證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論,並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。 +##### **LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning** +2502.14644v1 by Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang -##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education** -2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis +Long context understanding remains challenging for large language models due +to their limited context windows. This paper presents Long Input Fine-Tuning +(LIFT), a novel framework for long-context modeling that can improve the +long-context performance of arbitrary (short-context) LLMs by dynamically +adapting model parameters based on the long input. Importantly, LIFT, rather +than endlessly extending the context window size to accommodate increasingly +longer inputs in context, chooses to store and absorb the long input in +parameter. By fine-tuning the long input into model parameters, LIFT allows +short-context LLMs to answer questions even when the required information is +not provided in the context during inference. Furthermore, to enhance LIFT +performance while maintaining the original in-context learning (ICL) +capabilities, we introduce Gated Memory, a specialized attention adapter that +automatically balances long input memorization and ICL. We provide a +comprehensive analysis of the strengths and limitations of LIFT on long context +understanding, offering valuable directions for future research. -In the era of exponential technology growth, one unexpected guest has claimed -a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as -ChatGPT, promises a revolution in education, yet it arrives with a double-edged -sword. Its potential for personalized learning is offset by issues of cheating, -inaccuracies, and educators struggling to incorporate it effectively into their -lesson design. We are standing on the brink of this educational frontier, and -it is clear that we need to navigate this terrain with a lot of care. This is a -major challenge that could undermine the integrity and value of our educational -process. So, how can we turn these challenges into opportunities? When used -inappropriately, AI tools can become the perfect tool for the cut copy paste -mentality, and quickly begin to corrode critical thinking, creativity, and deep -understanding, the most important skills in our rapidly changing world. -Teachers feel that they are not equipped to leverage this technology, widening -the digital divide among educators and institutions. Addressing these concerns -calls for an in depth research approach. We will employ empirical research, -drawing on the Technology Acceptance Model, to assess the attitudes toward -generative AI among educators and students. Understanding their perceptions, -usage patterns, and hurdles is the first crucial step in creating an effective -solution. The present study will be used as a process manual for future -researchers to apply, running their own data, based on the steps explained here +摘要:由於大型語言模型的上下文視窗有限,因此對於它們而言,長語境理解仍然具有挑戰性。本文提出了長輸入微調 (LIFT),這是一個用於長語境建模的新穎架構,它可以通過根據長輸入動態調整模型參數來改善任意(短語境)LLM 的長語境效能。重要的是,LIFT 沒有無限擴充上下文視窗大小以容納語境中越來越長的輸入,而是選擇將長輸入儲存在參數中並吸收它。通過將長輸入微調到模型參數中,LIFT 允許短語境 LLM 回答問題,即使在推理期間語境中沒有提供所需資訊也是如此。此外,為了在保持原始語境中學習 (ICL) 能力的同時增強 LIFT 效能,我們引入了閘控記憶體,這是一個自動平衡長輸入記憶和 ICL 的特殊注意力適配器。我們對 LIFT 在長語境理解方面的優缺點進行了全面的分析,為未來的研究提供了有價值的方向。 -摘要:在科技飛速發展的時代,一位意外的訪客已在全球教室中佔有一席之地,那就是人工智慧。生成式 AI,例如 ChatGPT,承諾在教育領域掀起一場革命,但它卻是一把雙面刃。它在個人化學習方面的潛力,卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣,顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰,可能會損害我們教育過程的完整性和價值。那麼,我們如何將這些挑戰轉化為機遇?當不適當地使用時,AI 工具可能會成為複製貼上心態的完美工具,並迅速腐蝕批判性思維、創造力和深入理解,這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術,這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究,借鑑技術接受模型,來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊,根據此處說明的步驟運行他們自己的數據 +##### **Length-Controlled Margin-Based Preference Optimization without Reference Model** +2502.14643v1 by Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu -##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data** -2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk +Direct Preference Optimization (DPO) is a widely adopted offline algorithm +for preference-based reinforcement learning from human feedback (RLHF), +designed to improve training simplicity and stability by redefining reward +functions. However, DPO is hindered by several limitations, including length +bias, memory inefficiency, and probability degradation. To address these +challenges, we propose Length-Controlled Margin-Based Preference Optimization +(LMPO), a more efficient and robust alternative. LMPO introduces a uniform +reference model as an upper bound for the DPO loss, enabling a more accurate +approximation of the original optimization objective. Additionally, an average +log-probability optimization strategy is employed to minimize discrepancies +between training and inference phases. A key innovation of LMPO lies in its +Length-Controlled Margin-Based loss function, integrated within the +Bradley-Terry framework. This loss function regulates response length while +simultaneously widening the margin between preferred and rejected outputs. By +doing so, it mitigates probability degradation for both accepted and discarded +responses, addressing a significant limitation of existing methods. We evaluate +LMPO against state-of-the-art preference optimization techniques on two +open-ended large language models, Mistral and LLaMA3, across six conditional +benchmarks. Our experimental results demonstrate that LMPO effectively controls +response length, reduces probability degradation, and outperforms existing +approaches. The code is available at \url{https://github.com/gengxuli/LMPO}. -With the digitalization of health care systems, artificial intelligence -becomes more present in medicine. Especially machine learning shows great -potential for complex tasks such as time series classification, usually at the -cost of transparency and comprehensibility. This leads to a lack of trust by -humans and thus hinders its active usage. Explainable artificial intelligence -tries to close this gap by providing insight into the decision-making process, -the actual usefulness of its different methods is however unclear. This paper -proposes a user study based evaluation of the explanation method Grad-CAM with -application to a neural network for the classification of breaths in time -series neonatal ventilation data. We present the perceived usefulness of the -explainability method by different stakeholders, exposing the difficulty to -achieve actual transparency and the wish for more in-depth explanations by many -of the participants. +摘要:直接偏好優化 (DPO) 是一種廣泛採用的離線演算法,用於從人類回饋 (RLHF) 中進行基於偏好的強化學習,旨在透過重新定義獎勵函數來提升訓練的簡潔性和穩定性。然而,DPO 受到若干限制的阻礙,包括長度偏差、記憶體效率低下和機率下降。為了解決這些挑戰,我們提出長度控制邊際偏好優化 (LMPO),一種更有效率且穩健的替代方案。LMPO 引入統一參考模型作為 DPO 損失的上限,能夠更準確地近似原始最佳化目標。此外,採用平均對數機率最佳化策略來最小化訓練和推論階段之間的差異。LMPO 的一項關鍵創新在於其長度控制邊際損失函數,整合在 Bradley-Terry 架構中。此損失函數調節回應長度,同時擴大偏好和拒絕輸出之間的邊際。藉由這麼做,它減輕了已接受和已捨棄回應的機率下降,解決了現有方法的重大限制。我們在兩個開放式大型語言模型 Mistral 和 LLaMA3 上,針對六個條件基準,評估 LMPO 與最先進的偏好優化技術。我們的實驗結果證明,LMPO 有效控制回應長度,減少機率下降,並優於現有方法。程式碼可在 \url{https://github.com/gengxuli/LMPO} 取得。 -摘要:隨著醫療保健系統的數位化,人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力,但通常是以透明度和可理解性為代價。這導致人類缺乏信任,從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距,但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估,其中包含了 Grad-CAM 解釋方法,並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用,揭示了實現實際透明度的難度,以及許多參與者希望獲得更深入的解釋。 +##### **How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation** +2502.14642v1 by Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui -##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare** -2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio +Recently, LLMs have garnered increasing attention across academic disciplines +for their potential as human digital twins, virtual proxies designed to +replicate individuals and autonomously perform tasks such as decision-making, +problem-solving, and reasoning on their behalf. However, current evaluations of +LLMs primarily emphasize dialogue simulation while overlooking human behavior +simulation, which is crucial for digital twins. To address this gap, we +introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to +simulate continuous human behavior. BehaviorChain comprises diverse, +high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors +across 1,001 unique personas, each with detailed history and profile metadata. +For evaluation, we integrate persona metadata into LLMs and employ them to +iteratively infer contextually appropriate behaviors within dynamic scenarios +provided by BehaviorChain. Comprehensive evaluation results demonstrated that +even state-of-the-art models struggle with accurately simulating continuous +human behavior. -The integration of Large Language Models (LLMs) into healthcare diagnostics -offers a promising avenue for clinical decision-making. This study outlines the -development of a novel method for zero-shot/few-shot in-context learning (ICL) -by integrating medical domain knowledge using a multi-layered structured -prompt. We also explore the efficacy of two communication styles between the -user and LLMs: the Numerical Conversational (NC) style, which processes data -incrementally, and the Natural Language Single-Turn (NL-ST) style, which -employs long narrative prompts. - Our study systematically evaluates the diagnostic accuracy and risk factors, -including gender bias and false negative rates, using a dataset of 920 patient -records in various few-shot scenarios. Results indicate that traditional -clinical machine learning (ML) models generally outperform LLMs in zero-shot -and few-shot settings. However, the performance gap narrows significantly when -employing few-shot examples alongside effective explainable AI (XAI) methods as -sources of domain knowledge. Moreover, with sufficient time and an increased -number of examples, the conversational style (NC) nearly matches the -performance of ML models. Most notably, LLMs demonstrate comparable or superior -cost-sensitive accuracy relative to ML models. - This research confirms that, with appropriate domain knowledge and tailored -communication strategies, LLMs can significantly enhance diagnostic processes. -The findings highlight the importance of optimizing the number of training -examples and communication styles to improve accuracy and reduce biases in LLM -applications. +摘要:最近,LLM 在各個學科中備受關注,因為它們具有作為人類數位雙胞胎的潛力,也就是虛擬代理人,旨在複製個人並自主執行任務,例如代表他們進行決策、解決問題和推理。然而,LLM 目前的評估主要強調對話模擬,同時忽視了人類行為模擬,這對數位雙胞胎至關重要。為了解決這個差距,我們引入了 BehaviorChain,這是第一個用於評估 LLM 模擬連續人類行為能力的基準。BehaviorChain 包含多樣化、高品質、基於角色的行為鏈,總共涵蓋 1,001 個獨特角色的 15,846 種不同行為,每個角色都有詳細的歷史和個人資料元數據。在評估中,我們將角色元數據整合到 LLM 中,並使用它們在 BehaviorChain 提供的動態場景中反覆推斷出在情境中適當的行為。全面的評估結果表明,即使是最先進的模型在準確模擬連續人類行為方面也存在困難。 -摘要:大型語言模型 (LLM) 與醫療診斷整合 -為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發,用於零次學習/少量學習情境學習 (ICL),方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效:數值對話 (NC) 方式,它會逐步處理資料,以及自然語言單回合 (NL-ST) 方式,它會使用長篇敘事提示。 -我們的研究系統性地評估了診斷準確性和風險因子,包括性別偏見和假陰性率,使用了一個包含 920 個患者記錄的資料集,採用各種少量學習情境。結果表明,傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而,當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時,效能差距會顯著縮小。此外,隨著時間充足和範例數量增加,對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是,LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。 -本研究證實,透過適當的領域知識和量身打造的溝通策略,LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性,以提高準確度並減少 LLM 應用中的偏差。 +##### **NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization** +2502.14638v1 by Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber -##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems** -2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho +Image geo-localization is the task of predicting the specific location of an +image and requires complex reasoning across visual, geographical, and cultural +contexts. While prior Vision Language Models (VLMs) have the best accuracy at +this task, there is a dearth of high-quality datasets and models for analytical +reasoning. We first create NaviClues, a high-quality dataset derived from +GeoGuessr, a popular geography game, to supply examples of expert reasoning +from language. Using this dataset, we present Navig, a comprehensive image +geo-localization framework integrating global and fine-grained image +information. By reasoning with language, Navig reduces the average distance +error by 14% compared to previous state-of-the-art models while requiring fewer +than 1000 training samples. Our dataset and code are available at +https://github.com/SparrowZheyuan18/Navig/. -The increasing reliance on Deep Learning models, combined with their inherent -lack of transparency, has spurred the development of a novel field of study -known as eXplainable AI (XAI) methods. These methods seek to enhance the trust -of end-users in automated systems by providing insights into the rationale -behind their decisions. This paper presents a novel approach for measuring user -trust in XAI systems, allowing their refinement. Our proposed metric combines -both performance metrics and trust indicators from an objective perspective. To -validate this novel methodology, we conducted a case study in a realistic -medical scenario: the usage of XAI system for the detection of pneumonia from -x-ray images. +摘要:影像地理定位是預測影像特定位置的任務,需要跨視覺、地理和文化脈絡進行複雜的推理。雖然先前的視覺語言模型 (VLM) 在此任務中擁有最佳準確度,但缺乏高品質的資料集和分析推理模型。我們首先建立 NaviClues,這是一個源自 GeoGuessr 的高品質資料集,GeoGuessr 是一款流行的地理遊戲,可提供來自語言的專家推理範例。使用此資料集,我們提出 Navig,這是一個綜合性的影像地理定位架構,整合了全球和細緻的影像資訊。透過語言推理,Navig 將平均距離誤差減少了 14%,與先前的最先進模型相比,同時只需要不到 1000 個訓練樣本。我們的資料集和程式碼可在 https://github.com/SparrowZheyuan18/Navig/ 取得。 -摘要:隨著對深度學習模型依賴性的增加,加上其固有的透明度不足,促使一個新的研究領域發展,稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理,來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法,允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法,我們在一個真實的醫療場景中進行了一個案例研究:使用 XAI 系統從 X 光影像中偵測肺炎。 +##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** +2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu -##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19** -2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao +Protein backbone generation plays a central role in de novo protein design +and is significant for many biological and medical applications. Although +diffusion and flow-based generative models provide potential solutions to this +challenging task, they often generate proteins with undesired designability and +suffer computational inefficiency. In this study, we propose a novel rectified +quaternion flow (ReQFlow) matching method for fast and high-quality protein +backbone generation. In particular, our method generates a local translation +and a 3D rotation from random noise for each residue in a protein chain, which +represents each 3D rotation as a unit quaternion and constructs its flow by +spherical linear interpolation (SLERP) in an exponential format. We train the +model by quaternion flow (QFlow) matching with guaranteed numerical stability +and rectify the QFlow model to accelerate its inference and improve the +designability of generated protein backbones, leading to the proposed ReQFlow +model. Experiments show that ReQFlow achieves state-of-the-art performance in +protein backbone generation while requiring much fewer sampling steps and +significantly less inference time (e.g., being 37x faster than RFDiffusion and +62x faster than Genie2 when generating a backbone of length 300), demonstrating +its effectiveness and efficiency. The code is available at +https://github.com/AngxiaoYue/ReQFlow. -The COVID-19 pandemic has strained global public health, necessitating -accurate diagnosis and intervention to control disease spread and reduce -mortality rates. This paper introduces an interpretable deep survival -prediction model designed specifically for improved understanding and trust in -COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale -pretrained image encoder, Risk-specific Grad-CAM, and anatomical region -detection techniques, our approach produces regional interpretable outcomes -that effectively capture essential disease features while focusing on rare but -critical abnormal regions. Our model's predictive results provide enhanced -clarity and transparency through risk area localization, enabling clinicians to -make informed decisions regarding COVID-19 diagnosis with better understanding -of prognostic insights. We evaluate the proposed method on a multi-center -survival dataset and demonstrate its effectiveness via quantitative and -qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and -time-dependent AUCs (0.799 and 0.691). These results suggest that our -explainable deep survival prediction model surpasses traditional survival -analysis methods in risk prediction, improving interpretability for clinical -decision making and enhancing AI system trustworthiness. +摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 -摘要:COVID-19 疫情對全球公共衛生造成壓力,必須進行準確的診斷和干預,以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型,專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術,我們的做法產生區域可解釋的結果,有效捕捉必要的疾病特徵,同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度,讓臨床醫生能夠在更了解預後見解的情況下,就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法,並透過量化和質化評估證明其有效性,達到優異的 C 指數(0.764 和 0.727)和時間相關 AUC(0.799 和 0.691)。這些結果表明,我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法,提升臨床決策的解釋性,並增強 AI 系統的信賴度。 +##### **PEARL: Towards Permutation-Resilient LLMs** +2502.14628v1 by Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong -##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics** -2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile +The in-context learning (ICL) capability of large language models (LLMs) +enables them to perform challenging tasks using provided demonstrations. +However, ICL is highly sensitive to the ordering of demonstrations, leading to +instability in predictions. This paper shows that this vulnerability can be +exploited to design a natural attack - difficult for model providers to detect +- that achieves nearly 80% success rate on LLaMA-3 by simply permuting the +demonstrations. Existing mitigation methods primarily rely on post-processing +and fail to enhance the model's inherent robustness to input permutations, +raising concerns about safety and reliability of LLMs. To address this issue, +we propose Permutation-resilient learning (PEARL), a novel framework based on +distributionally robust optimization (DRO), which optimizes model performance +against the worst-case input permutation. Specifically, PEARL consists of a +permutation-proposal network (P-Net) and the LLM. The P-Net generates the most +challenging permutations by treating it as an optimal transport problem, which +is solved using an entropy-constrained Sinkhorn algorithm. Through minimax +optimization, the P-Net and the LLM iteratively optimize against each other, +progressively improving the LLM's robustness. Experiments on synthetic +pre-training and real-world instruction tuning tasks demonstrate that PEARL +effectively mitigates permutation attacks and enhances performance. Notably, +despite being trained on fewer shots and shorter contexts, PEARL achieves +performance gains of up to 40% when scaled to many-shot and long-context +scenarios, highlighting its efficiency and generalization capabilities. -In recent years, machine learning-based clinical decision support systems -(CDSS) have played a key role in the analysis of several medical conditions. -Despite their promising capabilities, the lack of transparency in AI models -poses significant challenges, particularly in medical contexts where -reliability is a mandatory aspect. However, it appears that explainability is -inversely proportional to accuracy. For this reason, achieving transparency -without compromising predictive accuracy remains a key challenge. This paper -presents a novel method, namely Rad4XCNN, to enhance the predictive power of -CNN-derived features with the inherent interpretability of radiomic features. -Rad4XCNN diverges from conventional methods based on saliency maps, by -associating intelligible meaning to CNN-derived features by means of Radiomics, -offering new perspectives on explanation methods beyond visualization maps. -Using a breast cancer classification task as a case study, we evaluated -Rad4XCNN on ultrasound imaging datasets, including an online dataset and two -in-house datasets for internal and external validation. Some key results are: -i) CNN-derived features guarantee more robust accuracy when compared against -ViT-derived and radiomic features; ii) conventional visualization map methods -for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice -model accuracy for their explainability; iv) Rad4XCNN provides a global -explanation enabling the physician to extract global insights and findings. Our -method can mitigate some concerns related to the explainability-accuracy -trade-off. This study highlighted the importance of proposing new methods for -model explanation without affecting their accuracy. +摘要:大型語言模型 (LLM) 的語境學習 (ICL) 能力使其能夠透過提供的示範來執行具有挑戰性的任務。然而,ICL 對示範的排序非常敏感,導致預測不穩定。本文顯示,可以利用此漏洞來設計一種自然攻擊,讓模型提供者難以偵測,透過簡單地排列示範,在 LLaMA-3 上達到近 80% 的成功率。現有的緩解方法主要依賴後處理,且無法增強模型對輸入排列的固有穩健性,引發了對 LLM 的安全性與可靠性的疑慮。為了解決此問題,我們提出了一種基於分配穩健最佳化 (DRO) 的新型架構,稱為排列彈性學習 (PEARL),它針對最差情況的輸入排列來最佳化模型效能。具體來說,PEARL 包含排列建議網路 (P-Net) 和 LLM。P-Net 將其視為最優傳輸問題來產生最具挑戰性的排列,並使用熵約束 Sinkhorn 演算法來解決。透過極小極大最佳化,P-Net 和 LLM 迭代地相互最佳化,逐步改善 LLM 的穩健性。在合成預訓練和真實世界指令調整任務上的實驗證明,PEARL 有效地減輕了排列攻擊並增強了效能。值得注意的是,儘管在較少的次數和較短的語境中進行訓練,但 PEARL 在擴展到多重次數和長語境場景時仍可獲得高達 40% 的效能提升,突顯了其效率和泛化能力。 + +##### **ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors** +2502.14627v1 by Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou + +Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to +retrieve audio clips or multilingual texts from databases. However, existing +ML-ATR schemes suffer from inconsistencies for instance similarity matching +across languages. We theoretically analyze the inconsistency in terms of both +multilingual modal alignment direction error and weight error, and propose the +theoretical weight error upper bound for quantifying the inconsistency. Based +on the analysis of the weight error upper bound, we find that the inconsistency +problem stems from the data distribution error caused by random sampling of +languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive +learning and audio-English co-anchor contrastive learning, aiming to mitigate +the negative impact of data distribution error on recall and consistency in +ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets +show that our scheme achieves state-of-the-art performance on recall and +consistency metrics for eight mainstream languages, including English. Our code +will be available at https://github.com/ATRI-ACL/ATRI-ACL. -摘要:近年来,基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景,但 AI 模型缺乏透明度,尤其在医疗领域,可靠性是强制性方面,这带来了重大挑战。然而,解释性似乎与准确性成反比。因此,在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法,即 Rad4XCNN,以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来,从而偏离了基于显着性图的传统方法,为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究,我们在超声成像数据集上评估了 Rad4XCNN,包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是:i) 与 ViT 衍生和放射组学特征相比,CNN 衍生特征保证了更稳健的准确性;ii) 用于解释的传统可视化图方法存在一些缺陷;iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性;iv) Rad4XCNN 提供全局解释,使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。 +摘要:多模態多語言音訊文字檢索 (ML-ATR) 是一項具有挑戰性的任務,旨在從資料庫中檢索音訊片段或多語言文字。然而,現有的 ML-ATR 架構存在不一致的情況,例如跨語言的相似性比對。我們在理論上分析了不一致性,包括多模態多語言對齊方向誤差和權重誤差,並提出理論權重誤差上限以量化不一致性。根據權重誤差上限的分析,我們發現不一致性問題源於由語言隨機取樣造成的資料分佈誤差。我們提出一個一致的 ML-ATR 架構,採用 1 對 k 對比學習和音訊-英語共同錨點對比學習,旨在減輕資料分佈誤差對 ML-ATR 中召回率和一致性的負面影響。在已翻譯的 AudioCaps 和 Clotho 資料集上的實驗結果顯示,我們的架構在包括英語在內的八種主流語言的召回率和一致性指標上達到了最先進的效能。我們的程式碼將在 https://github.com/ATRI-ACL/ATRI-ACL 中提供。 -##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability** -2404.16957v1 by Yunfei Ge, Quanyan Zhu +##### **Multi-Record Web Page Information Extraction From News Websites** +2502.14625v1 by Alexander Kustenkov, Maksim Varlamov, Alexander Yatskov -The pervasive integration of Artificial Intelligence (AI) has introduced -complex challenges in the responsibility and accountability in the event of -incidents involving AI-enabled systems. The interconnectivity of these systems, -ethical concerns of AI-induced incidents, coupled with uncertainties in AI -technology and the absence of corresponding regulations, have made traditional -responsibility attribution challenging. To this end, this work proposes a -Computational Reflective Equilibrium (CRE) approach to establish a coherent and -ethically acceptable responsibility attribution framework for all stakeholders. -The computational approach provides a structured analysis that overcomes the -limitations of conceptual approaches in dealing with dynamic and multifaceted -scenarios, showcasing the framework's explainability, coherence, and adaptivity -properties in the responsibility attribution process. We examine the pivotal -role of the initial activation level associated with claims in equilibrium -computation. Using an AI-assisted medical decision-support system as a case -study, we illustrate how different initializations lead to diverse -responsibility distributions. The framework offers valuable insights into -accountability in AI-induced incidents, facilitating the development of a -sustainable and resilient system through continuous monitoring, revision, and -reflection. +In this paper, we focused on the problem of extracting information from web +pages containing many records, a task of growing importance in the era of +massive web data. Recently, the development of neural network methods has +improved the quality of information extraction from web pages. Nevertheless, +most of the research and datasets are aimed at studying detailed pages. This +has left multi-record "list pages" relatively understudied, despite their +widespread presence and practical significance. + To address this gap, we created a large-scale, open-access dataset +specifically designed for list pages. This is the first dataset for this task +in the Russian language. Our dataset contains 13,120 web pages with news lists, +significantly exceeding existing datasets in both scale and complexity. Our +dataset contains attributes of various types, including optional and +multi-valued, providing a realistic representation of real-world list pages. +These features make our dataset a valuable resource for studying information +extraction from pages containing many records. + Furthermore, we proposed our own multi-stage information extraction methods. +In this work, we explore and demonstrate several strategies for applying +MarkupLM to the specific challenges of multi-record web pages. Our experiments +validate the advantages of our methods. + By releasing our dataset to the public, we aim to advance the field of +information extraction from multi-record pages. -摘要:隨著人工智慧 (AI) 的普及整合,在涉及 AI 驅動系統的事故中,責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題,加上 AI 技術的不確定性和缺乏相應法規,使得傳統責任歸屬面臨挑戰。為此,本研究提出了一種計算反思均衡 (CRE) 方法,以建立一個連貫且在倫理上可接受的責任歸屬架構,適用於所有利害關係人。計算方法提供了結構化的分析,克服了概念方法在處理動態且多面向情境時的限制,展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究,說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解,透過持續監控、修訂和反思,促進了永續且有韌性的系統發展。 +摘要:在本文中,我們專注於從包含大量記錄的網頁中提取資訊的問題,這項任務在海量網路資料的時代中越來越重要。最近,神經網路方法的發展已改善從網頁中提取資訊的品質。儘管如此,大多數的研究和資料集都旨在研究詳細的網頁。儘管多記錄「清單網頁」廣泛存在且具有實用意義,但它們相對來說研究較少。 +為了解決這個差距,我們建立了一個專門針對清單網頁設計的大規模、開放存取的資料集。這是俄語中第一個針對此任務的資料集。我們的資料集包含 13,120 個包含新聞清單的網頁,在規模和複雜度上都遠遠超過現有的資料集。我們的資料集包含各種類型的屬性,包括可選和多值,提供真實世界清單網頁的實際表示。這些特點使我們的資料集成為研究從包含大量記錄的網頁中提取資訊的寶貴資源。 +此外,我們提出了我們自己的多階段資訊提取方法。在這項工作中,我們探討並展示了將 MarkupLM 應用於多記錄網頁特定挑戰的幾種策略。我們的實驗驗證了我們方法的優點。 +透過向公眾發布我們的資料集,我們旨在推進從多記錄網頁中提取資訊的領域。 -##### **Explainable AI for Fair Sepsis Mortality Predictive Model** -2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang +##### **Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity** +2502.14620v1 by Xinghan Pan -Artificial intelligence supports healthcare professionals with predictive -modeling, greatly transforming clinical decision-making. This study addresses -the crucial need for fairness and explainability in AI applications within -healthcare to ensure equitable outcomes across diverse patient demographics. By -focusing on the predictive modeling of sepsis-related mortality, we propose a -method that learns a performance-optimized predictive model and then employs -the transfer learning process to produce a model with better fairness. Our -method also introduces a novel permutation-based feature importance algorithm -aiming at elucidating the contribution of each feature in enhancing fairness on -predictions. Unlike existing explainability methods concentrating on explaining -feature contribution to predictive performance, our proposed method uniquely -bridges the gap in understanding how each feature contributes to fairness. This -advancement is pivotal, given sepsis's significant mortality rate and its role -in one-third of hospital deaths. Our method not only aids in identifying and -mitigating biases within the predictive model but also fosters trust among -healthcare stakeholders by improving the transparency and fairness of model -predictions, thereby contributing to more equitable and trustworthy healthcare -delivery. +This paper investigates the efficacy of RWKV, a novel language model +architecture known for its linear attention mechanism, for generating sentence +embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate +the semantic similarity captured by embeddings from different hidden layers of +a pre-trained RWKV model. The performance is assessed on the Microsoft Research +Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared +against a GloVe-based baseline. My results indicate that while RWKV embeddings +capture some semantic relatedness, they underperform compared to the GloVe +baseline in terms of Spearman correlation. I also analyze the inference time +and GPU memory usage, highlighting the computational trade-offs associated with +RWKV embeddings. The findings suggest that while RWKV offers potential +advantages in terms of linear scaling, its zero-shot sentence embedding quality +for semantic similarity tasks requires further investigation and potential +task-specific fine-tuning to match or exceed simpler baselines. -摘要:人工智慧透過預測模型協助醫療專業人員,大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求,以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型,我們提出了一種方法,該方法會學習一個效能最佳化的預測模型,然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法,旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同,我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要,因為敗血症的死亡率很高,且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差,還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任,進而有助於提供更公平且值得信賴的醫療保健服務。 +摘要:本文探討 RWKV 的效能,這是一種以線性注意力機制聞名的語言模型架構,可用於在零次學習設定中產生句子嵌入。我進行逐層分析,以評估預先訓練的 RWKV 模型中不同隱藏層的嵌入所擷取的語義相似性。效能評估使用 Microsoft Research Paraphrase Corpus (MRPC) 資料集,採用 Spearman 相關係數,並與基於 GloVe 的基準進行比較。我的結果顯示,雖然 RWKV 嵌入可以擷取一些語義相關性,但與 GloVe 基準相比,在 Spearman 相關係數方面表現不佳。我也分析了推論時間和 GPU 記憶體使用量,強調與 RWKV 嵌入相關的運算折衷。這些發現表明,雖然 RWKV 在線性縮放方面具有潛在優勢,但其在語義相似性任務中的零次學習句子嵌入品質需要進一步探討,並需要潛在的特定任務微調,才能達到或超越較簡單的基準。 -##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence** -2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal +##### **Reward Models Identify Consistency, Not Causality** +2502.14619v1 by Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li -Depression is a significant issue nowadays. As per the World Health -Organization (WHO), in 2023, over 280 million individuals are grappling with -depression. This is a huge number; if not taken seriously, these numbers will -increase rapidly. About 4.89 billion individuals are social media users. People -express their feelings and emotions on platforms like Twitter, Facebook, -Reddit, Instagram, etc. These platforms contain valuable information which can -be used for research purposes. Considerable research has been conducted across -various social media platforms. However, certain limitations persist in these -endeavors. Particularly, previous studies were only focused on detecting -depression and the intensity of depression in tweets. Also, there existed -inaccuracies in dataset labeling. In this research work, five types of -depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted -using tweets from the Twitter database based on lexicon labeling. Explainable -AI was used to provide reasoning by highlighting the parts of tweets that -represent type of depression. Bidirectional Encoder Representations from -Transformers (BERT) was used for feature extraction and training. Machine -learning and deep learning methodologies were used to train the model. The BERT -model presented the most promising results, achieving an overall accuracy of -0.96. +Reward models (RMs) play a crucial role in aligning large language models +(LLMs) with human preferences and enhancing reasoning quality. Traditionally, +RMs are trained to rank candidate outputs based on their correctness and +coherence. However, in this work, we present several surprising findings that +challenge common assumptions about RM behavior. Our analysis reveals that +state-of-the-art reward models prioritize structural consistency over causal +correctness. Specifically, removing the problem statement has minimal impact on +reward scores, whereas altering numerical values or disrupting the reasoning +flow significantly affects RM outputs. Furthermore, RMs exhibit a strong +dependence on complete reasoning trajectories truncated or incomplete steps +lead to significant variations in reward assignments, indicating that RMs +primarily rely on learned reasoning patterns rather than explicit problem +comprehension. These findings hold across multiple architectures, datasets, and +tasks, leading to three key insights: (1) RMs primarily assess coherence rather +than true reasoning quality; (2) The role of explicit problem comprehension in +reward assignment is overstated; (3) Current RMs may be more effective at +ranking responses than verifying logical validity. Our results suggest a +fundamental limitation in existing reward modeling approaches, emphasizing the +need for a shift toward causality-aware reward models that go beyond +consistency-driven evaluation. -摘要:現今,憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料,在 2023 年,超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字;如果不認真看待,這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊,可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而,這些努力仍存在某些限制。特別是,先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外,資料集標籤中存在不準確的情況。在這項研究工作中,使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症(雙極型、重度、精神病型、非典型和產後)。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers(BERT)中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果,達到 0.96 的整體準確度。 +摘要:獎勵模型 (RM) 在將大型語言模型 (LLM) 與人類偏好對齊並提升推理品質方面扮演至關重要的角色。傳統上,RM 會訓練來根據候選輸出的正確性和一致性進行排名。然而,在這項工作中,我們提出幾個令人驚訝的發現,挑戰了關於 RM 行為的常見假設。我們的分析顯示,最先進的獎勵模型優先考慮結構一致性,而不是因果正確性。具體來說,移除問題陳述對獎勵分數的影響很小,而改變數值或中斷推理流程則會顯著影響 RM 輸出。此外,RM 表現出對完整推理軌跡的強烈依賴性,截斷或不完整的步驟會導致獎勵分配產生重大變化,這表示 RM 主要依賴於學習到的推理模式,而不是明確的問題理解。這些發現適用於多種架構、資料集和任務,得出三個關鍵見解:(1) RM 主要評估一致性,而不是真正的推理品質;(2) 在獎勵分配中,明確問題理解的角色被誇大了;(3) 目前的 RM 在排名回應方面可能比驗證邏輯有效性更有效。我們的結果表明現有獎勵建模方法存在根本限制,強調需要轉向因果感知獎勵模型,超越以一致性為導向的評估。 -##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images** -2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman +##### **FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis** +2502.14614v1 by Mingyi Jia, Junwen Duan, Yan Song, Jianxin Wang -Deep learning is dramatically transforming the field of medical imaging and -radiology, enabling the identification of pathologies in medical images, -including computed tomography (CT) and X-ray scans. However, the performance of -deep learning models, particularly in segmentation tasks, is often limited by -the need for extensive annotated datasets. To address this challenge, the -capabilities of weakly supervised semantic segmentation are explored through -the lens of Explainable AI and the generation of counterfactual explanations. -The scope of this research is development of a novel counterfactual inpainting -approach (COIN) that flips the predicted classification label from abnormal to -normal by using a generative model. For instance, if the classifier deems an -input medical image X as abnormal, indicating the presence of a pathology, the -generative model aims to inpaint the abnormal region, thus reversing the -classifier's original prediction label. The approach enables us to produce -precise segmentations for pathologies without depending on pre-existing -segmentation masks. Crucially, image-level labels are utilized, which are -substantially easier to acquire than creating detailed segmentation masks. The -effectiveness of the method is demonstrated by segmenting synthetic targets and -actual kidney tumors from CT images acquired from Tartu University Hospital in -Estonia. The findings indicate that COIN greatly surpasses established -attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an -alternative counterfactual explanation method introduced by Singla et al. This -evidence suggests that COIN is a promising approach for semantic segmentation -of tumors in CT images, and presents a step forward in making deep learning -applications more accessible and effective in healthcare, where annotated data -is scarce. +Retrieval-Augmented Large Language Models (LLMs), which integrate external +knowledge into LLMs, have shown remarkable performance in various medical +domains, including clinical diagnosis. However, existing RAG methods struggle +to effectively assess task difficulty to make retrieval decisions, thereby +failing to meet the clinical requirements for balancing efficiency and +accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained +\textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework +that improves the reliability of RAG in disease diagnosis scenarios. FIND +incorporates a fine-grained adaptive control module to determine whether +retrieval is necessary based on the information density of the input. By +optimizing the retrieval process and implementing a knowledge filtering module, +FIND ensures that the retrieval is better suited to clinical scenarios. +Experiments on three Chinese electronic medical record datasets demonstrate +that FIND significantly outperforms various baseline methods, highlighting its +effectiveness in clinical diagnosis tasks. -摘要:深度学习正大幅轉變醫學影像和放射線學領域,能辨識醫學影像中的病理,包括電腦斷層掃描 (CT) 和 X 光掃描。然而,深度學習模型的效能,特別是在分割任務中,常常受到廣泛註解資料集需求的限制。為了應對此挑戰,透過可解釋 AI 和反事實解釋的產生,探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN),該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如,如果分類器將輸入的醫學影像 X 視為異常,表示存在病理,則生成模型旨在內插異常區域,從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割,而無需依賴於預先存在的分割遮罩。至關重要的是,利用影像層級標籤,這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明,COIN 遠遠超過已建立的歸因方法,例如 RISE、ScoreCAM 和 LayerCAM,以及 Singla 等人提出的另一種反事實解釋方法。此證據表明,COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法,並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步,其中註解資料很稀少。 +摘要:檢索增強大型語言模型 (LLM),將外部知識整合至 LLM,已於各種醫療領域展現出卓越效能,包括臨床診斷。然而,現有的 RAG 方法難以有效評估任務難度以做出檢索決策,因此無法滿足平衡效率和精確度的臨床需求。因此,我們在本文中提出 FIND(**F**ine-grained **In**formation **D**ensity Guided Adaptive RAG),一種新穎架構,可提升 RAG 在疾病診斷場景中的可靠性。FIND 整合一個細緻化的自適應控制模組,根據輸入的資訊密度判斷是否需要檢索。透過最佳化檢索程序並實作一個知識過濾模組,FIND 確保檢索更適合臨床場景。在三個中文電子病歷資料集上的實驗顯示,FIND 明顯優於各種基線方法,突顯其在臨床診斷任務中的有效性。 -##### **Hybrid Intelligence for Digital Humanities** -2406.15374v1 by Victor de Boer, Lise Stork +##### **Behavioral Analysis of Information Salience in Large Language Models** +2502.14613v1 by Jan Trienes, Jörg Schlötterer, Junyi Jessy Li, Christin Seifert -In this paper, we explore the synergies between Digital Humanities (DH) as a -discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research, -the use of digital methods and specifically that of Artificial Intelligence is -subject to a set of requirements and constraints. We argue that these are -well-supported by the capabilities and goals of HI. Our contribution includes -the identification of five such DH requirements: Successful AI systems need to -be able to 1) collaborate with the (human) scholar; 2) support data criticism; -3) support tool criticism; 4) be aware of and cater to various perspectives and -5) support distant and close reading. We take the CARE principles of Hybrid -Intelligence (collaborative, adaptive, responsible and explainable) as -theoretical framework and map these to the DH requirements. In this mapping, we -include example research projects. We finally address how insights from DH can -be applied to HI and discuss open challenges for the combination of the two -disciplines. +Large Language Models (LLMs) excel at text summarization, a task that +requires models to select content based on its importance. However, the exact +notion of salience that LLMs have internalized remains unclear. To bridge this +gap, we introduce an explainable framework to systematically derive and +investigate information salience in LLMs through their summarization behavior. +Using length-controlled summarization as a behavioral probe into the content +selection process, and tracing the answerability of Questions Under Discussion +throughout, we derive a proxy for how models prioritize information. Our +experiments on 13 models across four datasets reveal that LLMs have a nuanced, +hierarchical notion of salience, generally consistent across model families and +sizes. While models show highly consistent behavior and hence salience +patterns, this notion of salience cannot be accessed through introspection, and +only weakly correlates with human perceptions of information salience. -摘要:在本文中,我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中,數位方法的使用,特別是人工智慧的使用,受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求:成功的 AI 系統需要能夠 1) 與(人類)學者合作;2) 支援資料批評;3) 支援工具批評;4) 察覺並迎合各種觀點;5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則(協作、適應、負責和可解釋)作為理論架構,並將這些原則對應到 DH 要求。在此對應中,我們納入範例研究專案。最後,我們探討如何將 DH 的見解應用於 HI,並討論結合這兩個學科的開放挑戰。 +摘要:大型語言模型 (LLM) 在文字摘要方面表現出色,這項任務需要模型根據重要性來選擇內容。然而,LLM 內化的顯著性準確概念仍不清楚。為了彌補這個差距,我們引入了一個可解釋的架構,透過摘要行為系統性地推導和調查 LLM 中的資訊顯著性。使用長度控制摘要作為行為探測來探討內容選擇過程,並追蹤討論中問題的可回答性,我們推導出一個模型優先處理資訊的方式代理。我們針對四個資料集中的 13 個模型進行的實驗揭示,LLM 具有細緻入微、階層式的顯著性概念,通常在模型系列和大小之間保持一致。雖然模型表現出高度一致的行為,因此具有顯著性模式,但這個顯著性概念無法透過內省來存取,而且與人類對資訊顯著性的認知僅有微弱相關性。 -##### **Ethical Framework for Responsible Foundational Models in Medical Imaging** -2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci +##### **A Theory for Conditional Generative Modeling on Multiple Data Sources** +2502.14583v1 by Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu -Foundational models (FMs) have tremendous potential to revolutionize medical -imaging. However, their deployment in real-world clinical settings demands -extensive ethical considerations. This paper aims to highlight the ethical -concerns related to FMs and propose a framework to guide their responsible -development and implementation within medicine. We meticulously examine ethical -issues such as privacy of patient data, bias mitigation, algorithmic -transparency, explainability and accountability. The proposed framework is -designed to prioritize patient welfare, mitigate potential risks, and foster -trust in AI-assisted healthcare. +The success of large generative models has driven a paradigm shift, +leveraging massive multi-source data to enhance model capabilities. However, +the interaction among these sources remains theoretically underexplored. This +paper takes the first step toward a rigorous analysis of multi-source training +in conditional generative modeling, where each condition represents a distinct +data source. Specifically, we establish a general distribution estimation error +bound in average total variation distance for conditional maximum likelihood +estimation based on the bracketing number. Our result shows that when source +distributions share certain similarities and the model is expressive enough, +multi-source training guarantees a sharper bound than single-source training. +We further instantiate the general theory on conditional Gaussian estimation +and deep generative models including autoregressive and flexible energy-based +models, by characterizing their bracketing numbers. The results highlight that +the number of sources and similarity among source distributions improve the +advantage of multi-source training. Simulations and real-world experiments +validate our theory. Code is available at: +\url{https://github.com/ML-GSAI/Multi-Source-GM}. -摘要:基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而,它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題,並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題,例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險,並培養對 AI 輔助醫療保健的信任。 +摘要:大型生成模型的成功推動了範例轉移,利用大量多來源資料來增強模型功能。然而,這些來源之間的互動在理論上仍未得到充分探討。本文踏出了嚴謹分析條件生成模型中多來源訓練的第一步,其中每個條件代表一個不同的資料來源。具體來說,我們建立了一個基於括號數的條件最大似然估計的平均總變異距離中的通用分佈估計誤差界限。我們的結果表明,當來源分佈具有一定的相似性且模型具有足夠的表達力時,多來源訓練保證了比單來源訓練更嚴格的界限。我們進一步在條件高斯估計和深度生成模型(包括自迴歸和靈活的基於能量的模型)上例證了通用理論,通過表徵它們的括號數。結果強調了來源數和來源分佈之間的相似性提高了多來源訓練的優勢。模擬和真實世界的實驗驗證了我們的理論。程式碼可在以下網址取得:\url{https://github.com/ML-GSAI/Multi-Source-GM}。 -##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis** -2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak +##### **A Statistical Case Against Empirical Human-AI Alignment** +2502.14581v1 by Julian Rodemann, Esteban Garces Arias, Christoph Luther, Christoph Jansen, Thomas Augustin -Thyroid cancer is an increasing global health concern that requires advanced -diagnostic methods. The application of AI and radiomics to thyroid cancer -diagnosis is examined in this review. A review of multiple databases was -conducted in compliance with PRISMA guidelines until October 2023. A -combination of keywords led to the discovery of an English academic publication -on thyroid cancer and related subjects. 267 papers were returned from the -original search after 109 duplicates were removed. Relevant studies were -selected according to predetermined criteria after 124 articles were eliminated -based on an examination of their abstract and title. After the comprehensive -analysis, an additional six studies were excluded. Among the 28 included -studies, radiomics analysis, which incorporates ultrasound (US) images, -demonstrated its effectiveness in diagnosing thyroid cancer. Various results -were noted, some of the studies presenting new strategies that outperformed the -status quo. The literature has emphasized various challenges faced by AI -models, including interpretability issues, dataset constraints, and operator -dependence. The synthesized findings of the 28 included studies mentioned the -need for standardization efforts and prospective multicenter studies to address -these concerns. Furthermore, approaches to overcome these obstacles were -identified, such as advances in explainable AI technology and personalized -medicine techniques. The review focuses on how AI and radiomics could transform -the diagnosis and treatment of thyroid cancer. Despite challenges, future -research on multidisciplinary cooperation, clinical applicability validation, -and algorithm improvement holds the potential to improve patient outcomes and -diagnostic precision in the treatment of thyroid cancer. +Empirical human-AI alignment aims to make AI systems act in line with +observed human behavior. While noble in its goals, we argue that empirical +alignment can inadvertently introduce statistical biases that warrant caution. +This position paper thus advocates against naive empirical alignment, offering +prescriptive alignment and a posteriori empirical alignment as alternatives. We +substantiate our principled argument by tangible examples like human-centric +decoding of language models. -摘要:甲狀腺癌是一種日益嚴重的全球健康問題,需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下,對多個資料庫進行了回顧,直到 2023 年 10 月。通過結合關鍵字,發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後,原始搜尋共回傳 267 篇論文。在根據預先確定的標準,淘汰了 124 篇文章的摘要和標題後,選出了相關研究。在進行全面分析後,額外排除了六項研究。在納入的 28 項研究中,結合超音波 (US) 影像的放射特徵分析,證明了其在診斷甲狀腺癌方面的有效性。研究結果不一,有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰,包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到,需要標準化工作和前瞻性多中心研究來解決這些問題。此外,還確定了克服這些障礙的方法,例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰,但未來對多學科合作、臨床適用性驗證和演算法改進的研究,仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。 +摘要:經驗主義的人工智慧校準旨在使人工智慧系統根據觀察到的人類行為採取行動。儘管目標崇高,我們認為經驗主義校準可能會無意中引入需要謹慎對待的統計偏差。因此,本立場文件主張反對天真的經驗主義校準,提供規範性校準和後驗經驗主義校準作為替代方案。我們以具體的例子(例如以人為中心的語言模型解碼)來證明我們的原則性論點。 -##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI** -2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia +##### **ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification** +2502.14565v1 by Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack -Breast cancer has rapidly increased in prevalence in recent years, making it -one of the leading causes of mortality worldwide. Among all cancers, it is by -far the most common. Diagnosing this illness manually requires significant time -and expertise. Since detecting breast cancer is a time-consuming process, -preventing its further spread can be aided by creating machine-based forecasts. -Machine learning and Explainable AI are crucial in classification as they not -only provide accurate predictions but also offer insights into how the model -arrives at its decisions, aiding in the understanding and trustworthiness of -the classification results. In this study, we evaluate and compare the -classification accuracy, precision, recall, and F-1 scores of five different -machine learning methods using a primary dataset (500 patients from Dhaka -Medical College Hospital). Five different supervised machine learning -techniques, including decision tree, random forest, logistic regression, naive -bayes, and XGBoost, have been used to achieve optimal results on our dataset. -Additionally, this study applied SHAP analysis to the XGBoost model to -interpret the model's predictions and understand the impact of each feature on -the model's output. We compared the accuracy with which several algorithms -classified the data, as well as contrasted with other literature in this field. -After final evaluation, this study found that XGBoost achieved the best model -accuracy, which is 97%. +Self-awareness, i.e., the ability to assess and correct one's own generation, +is a fundamental aspect of human intelligence, making its replication in large +language models (LLMs) an important yet challenging task. Previous works tackle +this by employing extensive reinforcement learning or rather relying on large +external verifiers. In this work, we propose Refine via Intrinsic +Self-Verification (ReVISE), an efficient and effective framework that enables +LLMs to self-correct their outputs through self-verification. The core idea of +ReVISE is to enable LLMs to verify their reasoning processes and continually +rethink reasoning trajectories based on its verification. We introduce a +structured curriculum based upon online preference learning to implement this +efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., +self-verification and reasoning correction), we tackle each task sequentially +using curriculum learning, collecting both failed and successful reasoning +paths to construct preference pairs for efficient training. During inference, +our approach enjoys natural test-time scaling by integrating self-verification +and correction capabilities, further enhanced by our proposed confidence-aware +decoding mechanism. Our experiments on various reasoning tasks demonstrate that +ReVISE achieves efficient self-correction and significantly improves reasoning +performance. -摘要:近年來,乳癌的盛行率迅速增加,使其成為全球主要的死亡原因之一。在所有癌症中,乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時,因此透過建立機器學習模型來預測,有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要,因為它們不僅可以提供準確的預測,還可以深入了解模型如何做出決策,有助於理解和信賴分類結果。在此研究中,我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數,使用了一個主要的資料集(達卡醫學院醫院的 500 名患者)。五種不同的監督式機器學習技術,包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost,已用於在我們的資料集上取得最佳結果。此外,本研究將 SHAP 分析應用於 XGBoost 模型,以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度,並與該領域的其他文獻進行對比。在最後評估後,本研究發現 XGBoost 達到了最佳的模型準確度,為 97%。 +摘要:自我覺察,亦即評估和修正自身產出的能力,是人類智慧的基本面向,使其能在大型語言模型 (LLM) 中複製,是一項重要且具挑戰性的任務。先前的研究透過採用廣泛的強化學習或依賴大型外部驗證器來解決這個問題。在這項研究中,我們提出透過內在自我驗證 (ReVISE) 進行精煉,一個有效率且有效的架構,使 LLM 能透過自我驗證來自我修正其產出。ReVISE 的核心概念是讓 LLM 能驗證其推理過程,並根據驗證結果持續重新思考推理軌跡。我們導入一個建構於線上偏好學習的結構化課程,以有效率地實作這項功能。具體來說,由於 ReVISE 涉及兩項具有挑戰性的任務(即自我驗證和推理修正),我們使用課程學習循序漸進地處理每一項任務,收集失敗和成功的推理路徑,以建構偏好對,進行有效率的訓練。在推論期間,我們的作法透過整合自我驗證和修正功能,享有自然的測試時間擴充,並進一步透過我們提出的具備信心感知的解碼機制進行強化。我們在各種推理任務上的實驗顯示,ReVISE 達到有效率的自我修正,並顯著提升推理效能。 -##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI** -2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir +##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** +2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao -The Deep learning (DL) models for diagnosing breast cancer from mammographic -images often operate as "black boxes", making it difficult for healthcare -professionals to trust and understand their decision-making processes. The -study presents an integrated framework combining Convolutional Neural Networks -(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis -of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an -elaborate data preprocessing pipeline and advanced data augmentation techniques -to counteract dataset limitations and transfer learning using pre-trained -networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of -our study is the evaluation of XAI's effectiveness in interpreting model -predictions, highlighted by utilizing the Hausdorff measure to assess the -alignment between AI-generated explanations and expert annotations -quantitatively. This approach is critical for XAI in promoting trustworthiness -and ethical fairness in AI-assisted diagnostics. The findings from our research -illustrate the effective collaboration between CNNs and XAI in advancing -diagnostic methods for breast cancer, thereby facilitating a more seamless -integration of advanced AI technologies within clinical settings. By enhancing -the interpretability of AI driven decisions, this work lays the groundwork for -improved collaboration between AI systems and medical practitioners, ultimately -enriching patient care. Furthermore, the implications of our research extended -well beyond the current methodologies. It encourages further research into how -to combine multimodal data and improve AI explanations to meet the needs of -clinical practice. +Large Language Models (LLMs) have demonstrated exceptional abilities in +reasoning for task planning. However, challenges remain under-explored for +parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in +which the model first decomposes a real-life textual task into executable +subtasks and constructs an abstract task graph. The model then understands this +task graph as input and generates a plan for parallel execution. To enhance the +planning capability of complex, scalable graphs, we design an automated and +controllable pipeline to generate synthetic graphs and propose a two-stage +training scheme. Experimental results show that our plan-over-graph method +significantly improves task performance on both API-based LLMs and trainable +open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally +supports parallel execution, demonstrating global efficiency. The code and data +are available at https://github.com/zsq259/Plan-over-Graph. -摘要:深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作,這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構,結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI),以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術,以對抗資料集限制,並採用預先訓練的網路(例如 VGG-16、Inception-V3 和 ResNet)進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性,重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作,從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性,這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎,最終豐富了患者照護。此外,我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋,以滿足臨床實務的需求。 +摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 -##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives** -2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu +##### **Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs** +2502.14561v1 by Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos -This research presents a novel multimodal data fusion methodology for pain -behavior recognition, integrating statistical correlation analysis with -human-centered insights. Our approach introduces two key innovations: 1) -integrating data-driven statistical relevance weights into the fusion strategy -to effectively utilize complementary information from heterogeneous modalities, -and 2) incorporating human-centric movement characteristics into multimodal -representation learning for detailed modeling of pain behaviors. Validated -across various deep learning architectures, our method demonstrates superior -performance and broad applicability. We propose a customizable framework that -aligns each modality with a suitable classifier based on statistical -significance, advancing personalized and effective multimodal fusion. -Furthermore, our methodology provides explainable analysis of multimodal data, -contributing to interpretable and explainable AI in healthcare. By highlighting -the importance of data diversity and modality-specific representations, we -enhance traditional fusion techniques and set new standards for recognizing -complex pain behaviors. Our findings have significant implications for -promoting patient-centered healthcare interventions and supporting explainable -clinical decision-making. +This work investigates the ability of open Large Language Models (LLMs) to +predict citation intent through in-context learning and fine-tuning. Unlike +traditional approaches that rely on pre-trained models like SciBERT, which +require extensive domain-specific pretraining and specialized architectures, we +demonstrate that general-purpose LLMs can be adapted to this task with minimal +task-specific data. We evaluate twelve model variations across five prominent +open LLM families using zero, one, few, and many-shot prompting to assess +performance across scenarios. Our experimental study identifies the +top-performing model through extensive experimentation of in-context +learning-related parameters, which we fine-tune to further enhance task +performance. The results highlight the strengths and limitations of LLMs in +recognizing citation intents, providing valuable insights for model selection +and prompt engineering. Additionally, we make our end-to-end evaluation +framework and models openly available for future use. -摘要:本研究提出了一種創新的多模態數據融合方法,用於疼痛行為識別,將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新:1) 將數據驅動的統計相關權重整合到融合策略中,以有效利用來自異質模態的補充信息,以及 2) 將以人為中心的運動特徵納入多模態表示學習中,以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證,展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架,根據統計顯著性將每個模態與合適的分類器對齊,推進個性化和有效的多模態融合。此外,我們的模型提供對多模態數據的可解釋分析,有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性,我們增強了傳統的融合技術,並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。 +摘要:本研究探討開放式大型語言模型 (LLM) 透過情境學習和微調來預測引文意圖的能力。與依賴於預訓練模型(例如 SciBERT)的傳統方法不同,後者需要廣泛的特定領域預訓練和專業架構,我們證明了通用 LLM 可以使用最少的特定任務數據來適應此任務。我們使用零次、一次、少次和多次提示評估五個著名的開放式 LLM 家族中的十二個模型變體,以評估不同場景的效能。我們的實驗研究透過廣泛的實驗來識別情境學習相關參數中效能最佳的模型,我們微調這些參數以進一步增強任務效能。結果突顯了 LLM 在識別引文意圖方面的優點和限制,為模型選擇和提示工程提供了有價值的見解。此外,我們將端到端評估架構和模型公開供未來使用。 -##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach** -2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini +##### **Less is More: Improving LLM Alignment via Preference Data Selection** +2502.14560v1 by Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He -Human-centered explainable AI (HCXAI) advocates for the integration of social -aspects into AI explanations. Central to the HCXAI discourse is the Social -Transparency (ST) framework, which aims to make the socio-organizational -context of AI systems accessible to their users. In this work, we suggest -extending the ST framework to address the risks of social misattributions in -Large Language Models (LLMs), particularly in sensitive areas like mental -health. In fact LLMs, which are remarkably capable of simulating roles and -personas, may lead to mismatches between designers' intentions and users' -perceptions of social attributes, risking to promote emotional manipulation and -dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To -address these issues, we propose enhancing the ST framework with a fifth -'W-question' to clarify the specific social attributions assigned to LLMs by -its designers and users. This addition aims to bridge the gap between LLM -capabilities and user perceptions, promoting the ethically responsible -development and use of LLM-based technology. +Direct Preference Optimization (DPO) has emerged as a promising approach for +aligning large language models with human preferences. While prior work mainly +extends DPO from the aspect of the objective function, we instead improve DPO +from the largely overlooked but critical aspect of data selection. +Specifically, we address the issue of parameter shrinkage caused by noisy data +by proposing a novel margin-maximization principle for dataset curation in DPO +training. To accurately estimate margins for data selection, we propose a +dual-margin guided approach that considers both external reward margins and +implicit DPO reward margins. Extensive experiments demonstrate that our method +reduces computational cost dramatically while improving performance. +Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach +achieves 3\% to 8\% improvements across various Llama and Mistral series models +on the AlpacaEval 2.0 benchmark. Furthermore, our approach seamlessly extends +to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, +while further reducing training time. These results highlight the potential of +data selection strategies for advancing preference optimization. -摘要:以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架,其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中,我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险,尤其是在心理健康等敏感领域。事实上,LLM 能够出色地模拟角色和人格,这可能导致设计者的意图和用户对社会属性的认知之间出现错配,从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题,我们建议用第五个“W 问题”来增强 ST 框架,以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距,促进基于 LLM 的技术在道德上负责任地开发和使用。 +摘要:直接偏好最佳化 (DPO) 已成為一種有希望的方法,可將大型語言模型與人類偏好保持一致。雖然先前的研究主要從目標函數的角度延伸 DPO,但我們反而從資料選擇這個極易被忽略但至關重要的角度改進 DPO。 +具體來說,我們透過提出一個用於 DPO 訓練中資料集整理的新邊際最大化原則,來解決由雜訊資料造成的參數收縮問題。為了準確估計資料選擇的邊際,我們提出一個雙邊際引導方法,它同時考慮外部獎勵邊際和隱含 DPO 獎勵邊際。大規模的實驗證明,我們的這種方法大幅降低了運算成本,同時改善了效能。 +值得注意的是,我們的這種方法僅使用 Ultrafeedback 資料集的 10%,便在 AlpacaEval 2.0 基準上,在各種 Llama 和 Mistral 系列模型中取得了 3% 到 8% 的改進。此外,我們的這種方法可以無縫地延伸到迭代 DPO,在使用 25% 線上資料的情況下產生了大約 3% 的改進,同時進一步減少了訓練時間。這些結果突顯了資料選擇策略在推進偏好最佳化方面的潛力。 -##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification** -2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu +##### **FUIA: Model Inversion Attack against Federated Unlearning** +2502.14558v1 by Lei Zhou, Youwen Zhu -Background: Pneumothorax is an acute thoracic disease caused by abnormal air -collection between the lungs and chest wall. To address the opaqueness often -associated with deep learning (DL) models, explainable artificial intelligence -(XAI) methods have been introduced to outline regions related to pneumothorax -diagnoses made by DL models. However, these explanations sometimes diverge from -actual lesion areas, highlighting the need for further improvement. Method: We -propose a template-guided approach to incorporate the clinical knowledge of -pneumothorax into model explanations generated by XAI methods, thereby -enhancing the quality of these explanations. Utilizing one lesion delineation -created by radiologists, our approach first generates a template that -represents potential areas of pneumothorax occurrence. This template is then -superimposed on model explanations to filter out extraneous explanations that -fall outside the template's boundaries. To validate its efficacy, we carried -out a comparative analysis of three XAI methods with and without our template -guidance when explaining two DL models in two real-world datasets. Results: The -proposed approach consistently improved baseline XAI methods across twelve -benchmark scenarios built on three XAI methods, two DL models, and two -datasets. The average incremental percentages, calculated by the performance -improvements over the baseline performance, were 97.8% in Intersection over -Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model -explanations and ground-truth lesion areas. Conclusions: In the context of -pneumothorax diagnoses, we proposed a template-guided approach for improving AI -explanations. We anticipate that our template guidance will forge a fresh -approach to elucidating AI models by integrating clinical domain expertise. +With the introduction of regulations related to the ``right to be forgotten", +federated learning (FL) is facing new privacy compliance challenges. To address +these challenges, researchers have proposed federated unlearning (FU). However, +existing FU research has primarily focused on improving the efficiency of +unlearning, with less attention paid to the potential privacy vulnerabilities +inherent in these methods. To address this gap, we draw inspiration from +gradient inversion attacks in FL and propose the federated unlearning inversion +attack (FUIA). The FUIA is specifically designed for the three types of FU +(sample unlearning, client unlearning, and class unlearning), aiming to provide +a comprehensive analysis of the privacy leakage risks associated with FU. In +FUIA, the server acts as an honest-but-curious attacker, recording and +exploiting the model differences before and after unlearning to expose the +features and labels of forgotten data. FUIA significantly leaks the privacy of +forgotten data and can target all types of FU. This attack contradicts the goal +of FU to eliminate specific data influence, instead exploiting its +vulnerabilities to recover forgotten data and expose its privacy flaws. +Extensive experimental results show that FUIA can effectively reveal the +private information of forgotten data. To mitigate this privacy leakage, we +also explore two potential defense methods, although these come at the cost of +reduced unlearning effectiveness and the usability of the unlearned model. -摘要:背景:氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習(DL)模型經常伴隨的不透明性,可解釋人工智慧(XAI)方法已被引入,用於概述與 DL 模型做出的氣胸診斷相關的區域。然而,這些解釋有時會與實際病灶區域有所出入,突顯出進一步改進的必要性。方法:我們提出了一種模板引導式方法,將氣胸的臨床知識納入 XAI 方法產生的模型解釋中,從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪,我們的做法首先產生一個模板,用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上,以篩選出超出模板邊界的無關解釋。為了驗證其效力,我們對三種 XAI 方法進行了比較分析,在兩個真實世界資料集中解釋兩個 DL 模型時,分別採用和不採用我們的模板引導。結果:所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中,始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時,透過基準效能的效能改進計算出的平均增量百分比為交集比(IoU)的 97.8% 和骰子相似性係數(DSC)的 94.1%。結論:在氣胸診斷的背景下,我們提出了一種模板引導式方法,用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識,為闡明 AI 模型建立一種新方法。 +摘要:隨著「被遺忘權」相關法規的推出, +聯盟學習 (FL) 面臨新的隱私合規挑戰。為了應對 +這些挑戰,研究人員提出了聯盟取消學習 (FU)。然而, +現有的 FU 研究主要集中在提高取消學習的效率,較少關注這些方法中固有的潛在隱私漏洞。為了解決這個差距,我們從 +FL 中的梯度反演攻擊中汲取靈感,並提出聯盟取消學習反演 +攻擊 (FUIA)。FUIA 專門設計用於三種類型的 FU +(樣本取消學習、客戶端取消學習和類別取消學習),旨在提供 +對與 FU 相關的隱私洩露風險的全面分析。在 +FUIA 中,伺服器充當誠實但好奇的攻擊者,記錄並 +利用取消學習前後的模型差異來揭露遺忘資料的功能和標籤。FUIA 大幅洩露遺忘資料的隱私,並且可以針對所有類型的 FU。此攻擊與 FU 消除特定資料影響的目標相矛盾,而是利用其 +漏洞來恢復遺忘資料並揭露其隱私缺陷。廣泛的實驗結果表明 FUIA 可以有效揭露遺忘資料的私人資訊。為了減輕這種隱私洩露,我們 +還探索了兩種潛在的防禦方法,儘管這些方法以降低取消學習的有效性和已取消學習模型的可用性為代價。 -##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures** -2403.01580v1 by Séamus Lankford +##### **Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling** +2502.14553v1 by Eric Egli, Matteo Manica, Jannis Born -In the current machine translation (MT) landscape, the Transformer -architecture stands out as the gold standard, especially for high-resource -language pairs. This research delves into its efficacy for low-resource -language pairs including both the English$\leftrightarrow$Irish and -English$\leftrightarrow$Marathi language pairs. Notably, the study identifies -the optimal hyperparameters and subword model type to significantly improve the -translation quality of Transformer models for low-resource language pairs. - The scarcity of parallel datasets for low-resource languages can hinder MT -development. To address this, gaHealth was developed, the first bilingual -corpus of health data for the Irish language. Focusing on the health domain, -models developed using this in-domain dataset exhibited very significant -improvements in BLEU score when compared with models from the LoResMT2021 -Shared Task. A subsequent human evaluation using the multidimensional quality -metrics error taxonomy showcased the superior performance of the Transformer -system in reducing both accuracy and fluency errors compared to an RNN-based -counterpart. - Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source -applications streamlined for the development, fine-tuning, and deployment of -neural machine translation models. These tools considerably simplify the setup -and evaluation process, making MT more accessible to both developers and -translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes -eco-friendly natural language processing research by highlighting the -environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM -demonstrated advancements in translation performance for two low-resource -language pairs: English$\leftrightarrow$Irish and -English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 -Shared Task. +Bytes form the basis of the digital world and thus are a promising building +block for multimodal foundation models. Recently, Byte Language Models (BLMs) +have emerged to overcome tokenization, yet the excessive length of bytestreams +requires new architectural paradigms. Therefore, we present the Multiscale Byte +Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows +training with context windows of $5$M bytes on single GPU in full model +precision. We thoroughly examine MBLM's performance with Transformer and Mamba +blocks on both unimodal and multimodal tasks. Our experiments demonstrate that +hybrid architectures are efficient in handling extremely long byte sequences +during training while achieving near-linear generational efficiency. To the +best of our knowledge, we present the first evaluation of BLMs on visual Q\&A +tasks and find that, despite serializing images and the absence of an encoder, +a MBLM with pure next token prediction can match custom CNN-LSTM architectures +with designated classification heads. We show that MBLMs exhibit strong +adaptability in integrating diverse data representations, including pixel and +image filestream bytes, underlining their potential toward omnimodal foundation +models. Source code is publicly available at: +https://github.com/ai4sd/multiscale-byte-lm -摘要:在當前機器翻譯 (MT) 領域中,Transformer 架構脫穎而出,成為黃金標準,特別是對於高資源語言對。本研究探討其對低資源語言對的效能,包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是,本研究識別出最佳超參數和子詞模型類型,以顯著提高 Transformer 模型對低資源語言對的翻譯品質。 -低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題,開發了 gaHealth,這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域,使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步,與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示,與基於 RNN 的對應模型相比,Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。 -此外,本論文介紹了 adaptNMT 和 adaptMLLM,這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程,讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是,adaptNMT 以 OpenNMT 生態系統為基礎,通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比,adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。 +摘要:位元組構成數位世界的基礎,因此是多模態基礎模型的一個有前途的建構模組。最近,位元組語言模型 (BLM) 已應運而生,以克服標記化,但位元組串流的過長需要新的架構範例。因此,我們提出多尺度位元組語言模型 (MBLM),這是一個與模型無關的分層解碼器堆疊,允許在單一 GPU 上以完整的模型精度訓練 500 萬位元組的內容視窗。我們徹底檢驗了 MBLM 在單模態和多模態任務上使用 Transformer 和 Mamba 區塊的效能。我們的實驗證明,混合架構在處理訓練期間極長的位元組序列時很有效率,同時達到近乎線性的生成效率。據我們所知,我們提出在視覺問答任務上對 BLM 的首次評估,並發現,儘管序列化影像且沒有編碼器,但具有純粹下一個標記預測的 MBLM 可以匹配具有指定分類標頭的客製化 CNN-LSTM 架構。我們表明,MBLM 在整合各種資料表示形式方面表現出強大的適應性,包括像素和影像檔案串流位元組,強調它們朝向全模態基礎模型的潛力。原始碼已公開於: +https://github.com/ai4sd/multiscale-byte-lm -##### **Cause and Effect: Can Large Language Models Truly Understand Causality?** -2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha +##### **Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks** +2502.14546v1 by Maya Bechler-Speicher, Ben Finkelshtein, Fabrizio Frasca, Luis Müller, Jan Tönshoff, Antoine Siraudin, Viktor Zaverkin, Michael M. Bronstein, Mathias Niepert, Bryan Perozzi, Mikhail Galkin, Christopher Morris -With the rise of Large Language Models(LLMs), it has become crucial to -understand their capabilities and limitations in deciphering and explaining the -complex web of causal relationships that language entails. Current methods use -either explicit or implicit causal reasoning, yet there is a strong need for a -unified approach combining both to tackle a wide array of causal relationships -more effectively. This research proposes a novel architecture called Context -Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to -enhance causal reasoning and explainability. The proposed framework -incorporates an explicit causal detection module with ConceptNet and -counterfactual statements, as well as implicit causal detection through LLMs. -Our framework goes one step further with a layer of counterfactual explanations -to accentuate LLMs understanding of causality. The knowledge from ConceptNet -enhances the performance of multiple causal reasoning tasks such as causal -discovery, causal identification and counterfactual reasoning. The -counterfactual sentences add explicit knowledge of the not caused by scenarios. -By combining these powerful modules, our model aims to provide a deeper -understanding of causal relationships, enabling enhanced interpretability. -Evaluation of benchmark datasets shows improved performance across all metrics, -such as accuracy, precision, recall, and F1 scores. We also introduce -CausalNet, a new dataset accompanied by our code, to facilitate further -research in this domain. +While machine learning on graphs has demonstrated promise in drug design and +molecular property prediction, significant benchmarking challenges hinder its +further progress and relevance. Current benchmarking practices often lack focus +on transformative, real-world applications, favoring narrow domains like +two-dimensional molecular graphs over broader, impactful areas such as +combinatorial optimization, relational databases, or chip design. Additionally, +many benchmark datasets poorly represent the underlying data, leading to +inadequate abstractions and misaligned use cases. Fragmented evaluations and an +excessive focus on accuracy further exacerbate these issues, incentivizing +overfitting rather than fostering generalizable insights. These limitations +have prevented the development of truly useful graph foundation models. This +position paper calls for a paradigm shift toward more meaningful benchmarks, +rigorous evaluation protocols, and stronger collaboration with domain experts +to drive impactful and reliable advances in graph learning research, unlocking +the potential of graph learning. -摘要:隨著大型語言模型 (LLM) 的興起,了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理,但強烈需要一種統一的方法,結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構,以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組,以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步,加入一層反事實解釋,以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行,例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組,我們的模型旨在提供對因果關係更深入的理解,實現增強的可解釋性。基準資料集的評估顯示在所有指標(例如準確度、精確度、召回率和 F1 分數)上都有所提升。我們還引入了 CausalNet,一個新的資料集,並附上了我們的程式碼,以促進在這個領域的進一步研究。 +摘要:儘管圖形上的機器學習在藥物設計和分子屬性預測方面已展現潛力,但顯著的基準挑戰阻礙了其進一步進展和相關性。目前的基準實務往往缺乏對轉型性、真實世界應用的關注,偏好於狹窄的領域,例如二維分子圖形,而不是組合最佳化、關係資料庫或晶片設計等更廣泛、更有影響力的領域。此外,許多基準資料集無法充分表示基礎資料,導致抽象化不充分和使用案例錯位。支離破碎的評估和過度關注準確性進一步加劇了這些問題,激勵過度擬合,而不是培養可概括的見解。這些限制阻礙了真正有用的圖形基礎模型的開發。這篇立場文件呼籲將範例轉變為更有意義的基準、嚴格的評估協定,以及與領域專家的更強大合作,以推動圖形學習研究中具有影響力和可靠性的進展,釋放圖形學習的潛力。 -##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina** -2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya +##### **LLM-based User Profile Management for Recommender System** +2502.14541v1 by Seunghwan Bang, Hwanjun Song -Diabetes mellitus (DM) predisposes patients to vascular complications. -Retinal images and vasculature reflect the body's micro- and macrovascular -health. They can be used to diagnose DM complications, including diabetic -retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular -disease, as well as forecast the risk of cardiovascular events. Artificial -intelligence (AI)-enabled systems developed for high-throughput detection of DR -using digitized retinal images have become clinically adopted. Beyond DR -screening, AI integration also holds immense potential to address challenges -associated with the holistic care of the patient with DM. In this work, we aim -to comprehensively review the literature for studies on AI applications based -on retinal images related to DM diagnosis, prognostication, and management. We -will describe the findings of holistic AI-assisted diabetes care, including but -not limited to DR screening, and discuss barriers to implementing such systems, -including issues concerning ethics, data privacy, equitable access, and -explainability. With the ability to evaluate the patient's health status vis a -vis DM complication as well as risk prognostication of future cardiovascular -complications, AI-assisted retinal image analysis has the potential to become a -central tool for modern personalized medicine in patients with DM. +The rapid advancement of Large Language Models (LLMs) has opened new +opportunities in recommender systems by enabling zero-shot recommendation +without conventional training. Despite their potential, most existing works +rely solely on users' purchase histories, leaving significant room for +improvement by incorporating user-generated textual data, such as reviews and +product descriptions. Addressing this gap, we propose PURE, a novel LLM-based +recommendation framework that builds and maintains evolving user profiles by +systematically extracting and summarizing key information from user reviews. +PURE consists of three core components: a Review Extractor for identifying user +preferences and key product features, a Profile Updater for refining and +updating user profiles, and a Recommender for generating personalized +recommendations using the most current profile. To evaluate PURE, we introduce +a continuous sequential recommendation task that reflects real-world scenarios +by adding reviews over time and updating predictions incrementally. Our +experimental results on Amazon datasets demonstrate that PURE outperforms +existing LLM-based methods, effectively leveraging long-term user information +while managing token limitations. -摘要:糖尿病(DM)使患者容易出現血管併發症。 -視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症,包括糖尿病視網膜病變(DR)、神經病變、腎病和動脈粥樣硬化性心血管疾病,以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧(AI)啟用系統已在臨床採用。除了 DR 篩檢外,AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中,我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻,這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現,包括但不限於 DR 篩檢,並討論實施此類系統的障礙,包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況,同時考量糖尿病併發症以及未來心血管併發症的風險預後,AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。 +摘要:大型語言模型 (LLM) 的快速進步為推薦系統開啟了新的機會,它能實現零次學習推薦,而無需傳統訓練。儘管有潛力,但現有的大部分工作僅依賴於使用者的購買記錄,透過納入使用者產生的文字資料,例如評論和產品說明,仍有很大的改進空間。針對此差距,我們提出 PURE,一個新穎的基於 LLM 的推薦架構,透過系統性地從使用者評論中提取和總結關鍵資訊,建立並維護不斷演進的使用者檔案。PURE 由三個核心組成部分組成:一個評論萃取器,用於識別使用者的喜好和產品主要功能;一個檔案更新器,用於精煉和更新使用者檔案;一個推薦器,用於使用最新的檔案產生個人化推薦。為了評估 PURE,我們引入一個連續順序推薦任務,透過隨著時間新增評論和遞增更新預測,反映真實世界的場景。我們在 Amazon 資料集上的實驗結果證明,PURE 優於現有的基於 LLM 的方法,在管理符號限制的同時,有效地利用長期使用者資訊。 +##### **LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization** +2502.14538v1 by Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu -### Medical -|Publish Date|Title|Authors|Homepage|Code| -| :---: | :---: | :---: | :---: | :---: | -|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| -|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| -|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| -|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| -|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| -|**2025-02-20**|**MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models**|Shrey Pandit et.al.|[2502.14302v1](http://arxiv.org/abs/2502.14302v1)|null| -|**2025-02-20**|**EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement**|Wenhui Zhu et.al.|[2502.14260v1](http://arxiv.org/abs/2502.14260v1)|null| -|**2025-02-19**|**Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning**|Cole Gawin et.al.|[2502.14086v1](http://arxiv.org/abs/2502.14086v1)|null| -|**2025-02-19**|**Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging**|Shansong Wang et.al.|[2502.14064v1](http://arxiv.org/abs/2502.14064v1)|null| -|**2025-02-19**|**VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare**|Anudeex Shetty et.al.|[2502.13775v1](http://arxiv.org/abs/2502.13775v1)|null| -|**2025-02-19**|**PeerQA: A Scientific Question Answering Dataset from Peer Reviews**|Tim Baumgärtner et.al.|[2502.13668v1](http://arxiv.org/abs/2502.13668v1)|[link](https://github.com/ukplab/peerqa)| -|**2025-02-19**|**Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs**|Yushi Feng et.al.|[2502.13555v1](http://arxiv.org/abs/2502.13555v1)|[link](https://github.com/ys-feng/DemoGraph)| -|**2025-02-19**|**MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis**|Wei Dai et.al.|[2502.13524v1](http://arxiv.org/abs/2502.13524v1)|[link](https://github.com/anthonyweidai/MobileViM_3D)| -|**2025-02-19**|**Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion**|Shuai Niu et.al.|[2502.13509v1](http://arxiv.org/abs/2502.13509v1)|null| -|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| -|**2025-02-19**|**RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering**|Sichu Liang et.al.|[2502.13361v1](http://arxiv.org/abs/2502.13361v1)|null| -|**2025-02-18**|**Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance**|Tejas Srinivasan et.al.|[2502.13321v1](http://arxiv.org/abs/2502.13321v1)|null| -|**2025-02-18**|**Prediction of Clinical Complication Onset using Neural Point Processes**|Sachini Weerasekara et.al.|[2502.13290v1](http://arxiv.org/abs/2502.13290v1)|null| -|**2025-02-18**|**SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?**|Yucheng Shi et.al.|[2502.13233v1](http://arxiv.org/abs/2502.13233v1)|null| -|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null| -|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null| -|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null| -|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Li et.al.|[2502.12825v2](http://arxiv.org/abs/2502.12825v2)|null| -|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|[link](https://github.com/Avenge-PRC777/LLM-Safety-For-Children-Code)| -|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null| -|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null| -|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|[link](https://github.com/AmmarKheder/AQ-Net)| -|**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null| -|**2025-02-17**|**LLM Agents Making Agent Tools**|Georg Wölflein et.al.|[2502.11705v1](http://arxiv.org/abs/2502.11705v1)|null| -|**2025-02-17**|**MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression**|Linjie Mu et.al.|[2502.11651v1](http://arxiv.org/abs/2502.11651v1)|[link](https://github.com/linjiemu/mmxu)| -|**2025-02-17**|**A Survey of Personalized Large Language Models: Progress and Future Directions**|Jiahong Liu et.al.|[2502.11528v1](http://arxiv.org/abs/2502.11528v1)|null| -|**2025-02-17**|**Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos**|Xiangxiang Cui et.al.|[2502.11481v1](http://arxiv.org/abs/2502.11481v1)|null| -|**2025-02-17**|**Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation**|Yanyan Wang et.al.|[2502.11456v1](http://arxiv.org/abs/2502.11456v1)|[link](https://github.com/Yaan-Wang/CRLN)| -|**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null| -|**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|[link](https://github.com/sohyu1/rt-demt)| -|**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|[link](https://github.com/alexlecu/llmkgraph)| -|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null| -|**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|[link](https://github.com/clmfap/clmfap)| -|**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null| -|**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null| -|**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|[link](https://github.com/hasakixie123/llm-evaluator-uncertainty)| -|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)| -|**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null| -|**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null| -|**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null| -|**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null| -|**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v2](http://arxiv.org/abs/2502.10526v2)|null| -|**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null| -|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| -|**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null| -|**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null| -|**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null| -|**2025-02-14**|**HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation**|Tianwei Lin et.al.|[2502.09838v2](http://arxiv.org/abs/2502.09838v2)|[link](https://github.com/dcdmllm/healthgpt)| -|**2025-02-13**|**Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games**|Tong Yang et.al.|[2502.09780v1](http://arxiv.org/abs/2502.09780v1)|null| -|**2025-02-13**|**The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention**|Bereket A. Yilma et.al.|[2502.09757v1](http://arxiv.org/abs/2502.09757v1)|null| -|**2025-02-13**|**A CNN Approach to Automated Detection and Classification of Brain Tumors**|Md. Zahid Hasan et.al.|[2502.09731v1](http://arxiv.org/abs/2502.09731v1)|null| -|**2025-02-13**|**Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data**|Yu Leng et.al.|[2502.09715v1](http://arxiv.org/abs/2502.09715v1)|null| -|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null| -|**2025-02-13**|**Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling**|Benjamin D. Killeen et.al.|[2502.09688v1](http://arxiv.org/abs/2502.09688v1)|null| -|**2025-02-13**|**Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models**|Wiktoria Mieleszczenko-Kowszewicz et.al.|[2502.09687v1](http://arxiv.org/abs/2502.09687v1)|null| -|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null| -|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null| -|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| -|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null| -|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null| -|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null| -|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)| -|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)| -|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null| -|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null| -|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null| -|**2025-02-12**|**Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models**|Hasin Rehana et.al.|[2502.09659v1](http://arxiv.org/abs/2502.09659v1)|null| -|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null| -|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null| -|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v2](http://arxiv.org/abs/2502.07752v2)|null| -|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)| -|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)| -|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null| -|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)| -|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null| -|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null| -|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null| -|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null| -|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null| -|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null| -|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null| -|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null| -|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null| -|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null| -|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| -|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null| -|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)| -|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null| -|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null| -|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null| -|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null| -|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null| -|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null| -|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null| -|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)| +Large Language Models (LLMs) have achieved remarkable success in natural +language processing, but their full fine-tuning remains resource-intensive. +Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation +(LoRA), have emerged as a practical solution by approximating parameter updates +with low-rank matrices. However, LoRA often exhibits a "double descent" +phenomenon during fine-tuning, where model performance degrades due to +overfitting and limited expressiveness caused by low-rank constraints. To +address this issue, we propose LoRA-GGPO (Gradient-Guided Perturbation +Optimization), a novel method that leverages gradient and weight norms to +generate targeted perturbations. By optimizing the sharpness of the loss +landscape, LoRA-GGPO guides the model toward flatter minima, mitigating the +double descent problem and improving generalization. Extensive experiments on +natural language understanding (NLU) and generation (NLG) tasks demonstrate +that LoRA-GGPO outperforms LoRA and its state-of-the-art variants. Furthermore, +extended experiments specifically designed to analyze the double descent +phenomenon confirm that LoRA-GGPO effectively alleviates this issue, producing +more robust and generalizable models. Our work provides a robust and efficient +solution for fine-tuning LLMs, with broad applicability in real-world +scenarios. The code is available at https://github.com/llm172/LoRA-GGPO. -#### Abstracts -##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** -2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub +摘要:大型語言模型 (LLM) 在自然語言處理方面取得了顯著的成功,但它們的完全微調仍然需要大量資源。參數高效微調 (PEFT) 方法(例如低秩適應 (LoRA))已成為一種實用的解決方案,它通過低秩矩陣近似參數更新。然而,LoRA 在微調過程中經常表現出「雙重下降」現象,其中模型性能會因過度擬合和低秩約束導致的表達能力有限而下降。為了解決這個問題,我們提出了 LoRA-GGPO(梯度引導擾動優化),這是一種利用梯度和權重範數來產生目標擾動的新方法。通過優化損失函數曲面的陡度,LoRA-GGPO 引導模型朝向更平坦的最小值,從而減輕雙重下降問題並改善泛化能力。在自然語言理解 (NLU) 和生成 (NLG) 任務中進行的廣泛實驗表明,LoRA-GGPO 優於 LoRA 及其最先進的變體。此外,專門設計用於分析雙重下降現象的延伸實驗證實,LoRA-GGPO 有效地緩解了這個問題,產生了更強大且更具泛化能力的模型。我們的研究為微調 LLM 提供了一個強大且高效的解決方案,在現實世界場景中具有廣泛的適用性。代碼可在 https://github.com/llm172/LoRA-GGPO 獲得。 -Foundation models are becoming increasingly effective in the medical domain, -offering pre-trained models on large datasets that can be readily adapted for -downstream tasks. Despite progress, fetal ultrasound images remain a -challenging domain for foundation models due to their inherent complexity, -often requiring substantial additional training and facing limitations due to -the scarcity of paired multimodal data. To overcome these challenges, here we -introduce FetalCLIP, a vision-language foundation model capable of generating -universal representation of fetal ultrasound images. FetalCLIP was pre-trained -using a multimodal learning approach on a diverse dataset of 210,035 fetal -ultrasound images paired with text. This represents the largest paired dataset -of its kind used for foundation model development to date. This unique training -approach allows FetalCLIP to effectively learn the intricate anatomical -features present in fetal ultrasound images, resulting in robust -representations that can be used for a variety of downstream applications. In -extensive benchmarking across a range of key fetal ultrasound applications, -including classification, gestational age estimation, congenital heart defect -(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all -baselines while demonstrating remarkable generalizability and strong -performance even with limited labeled data. We plan to release the FetalCLIP -model publicly for the benefit of the broader scientific community. +##### **CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models** +2502.14529v1 by Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo -摘要:基礎模型在醫療領域正變得越來越有效, -提供在大型資料集上預先訓練的模型,可輕鬆適應 -下游任務。儘管有進展,但胎兒超音波影像仍然是 -基礎模型的挑戰領域,因為它們固有的複雜性, -通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 -介紹 FetalCLIP,一種能夠產生 -胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 -超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 -方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 -表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 -(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 -效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 +Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated +remarkable real-world capabilities, effectively collaborating to complete +complex tasks. While these systems are designed with safety mechanisms, such as +rejecting harmful instructions through alignment, their security remains +largely unexplored. This gap leaves LLM-MASs vulnerable to targeted +disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks +(Corba), a novel and simple yet highly effective attack that disrupts +interactions between agents within an LLM-MAS. Corba leverages two key +properties: its contagious nature allows it to propagate across arbitrary +network topologies, while its recursive property enables sustained depletion of +computational resources. Notably, these blocking attacks often involve +seemingly benign instructions, making them particularly challenging to mitigate +using conventional alignment methods. We evaluate Corba on two widely-used +LLM-MASs, namely, AutoGen and Camel across various topologies and commercial +models. Additionally, we conduct more extensive experiments in open-ended +interactive LLM-MASs, demonstrating the effectiveness of Corba in complex +topology structures and open-source models. Our code is available at: +https://github.com/zhrli324/Corba. -##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** -2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes +摘要:基於大型語言模型的多主體系統(LLM-MAS)已展現出卓越的真實世界能力,有效地協作以完成複雜任務。儘管這些系統設計有安全機制,例如透過對齊拒絕有害指令,但其安全性仍未得到充分探討。此一缺口讓 LLM-MAS 易受針對性的破壞。在本文中,我們介紹了傳染性遞迴封鎖攻擊(Corba),這是一種新穎且簡單但極為有效的攻擊,會破壞 LLM-MAS 中主體之間的互動。Corba 利用了兩個關鍵特性:其傳染性使其能夠在任意網路拓撲中傳播,而其遞迴特性則能持續耗盡運算資源。值得注意的是,這些封鎖攻擊通常涉及看似良性的指令,這使得使用傳統對齊方法來減輕攻擊特別具有挑戰性。我們在兩個廣泛使用的 LLM-MAS,即 AutoGen 和 Camel 上評估了 Corba,涵蓋了各種拓撲和商業模型。此外,我們在開放式互動 LLM-MAS 中進行了更廣泛的實驗,證明了 Corba 在複雜拓撲結構和開源模型中的有效性。我們的程式碼可在以下網址取得:https://github.com/zhrli324/Corba。 -Fact verification (FV) aims to assess the veracity of a claim based on -relevant evidence. The traditional approach for automated FV includes a -three-part pipeline relying on short evidence snippets and encoder-only -inference models. More recent approaches leverage the multi-turn nature of LLMs -to address FV as a step-by-step problem where questions inquiring additional -context are generated and answered until there is enough information to make a -decision. This iterative method makes the verification process rational and -explainable. While these methods have been tested for encyclopedic claims, -exploration on domain-specific and realistic claims is missing. In this work, -we apply an iterative FV system on three medical fact-checking datasets and -evaluate it with multiple settings, including different LLMs, external web -search, and structured reasoning using logic predicates. We demonstrate -improvements in the final performance over traditional approaches and the high -potential of step-by-step FV systems for domain-specific claims. +##### **Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting** +2502.14525v1 by Yannick Wölker, Arash Hajisafi, Cyrus Shahabi, Matthias Renz -摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 +We propose a novel Graph Neural Network (GNN) model, named DeepStateGNN, for +analyzing traffic data, demonstrating its efficacy in two critical tasks: +forecasting and reconstruction. Unlike typical GNN methods that treat each +traffic sensor as an individual graph node, DeepStateGNN clusters sensors into +higher-level graph nodes, dubbed Deep State Nodes, based on various similarity +criteria, resulting in a fixed number of nodes in a Deep State graph. The term +"Deep State" nodes is a play on words, referencing hidden networks of power +that, like these nodes, secretly govern traffic independently of visible +sensors. These Deep State Nodes are defined by several similarity factors, +including spatial proximity (e.g., sensors located nearby in the road network), +functional similarity (e.g., sensors on similar types of freeways), and +behavioral similarity under specific conditions (e.g., traffic behavior during +rain). This clustering approach allows for dynamic and adaptive node grouping, +as sensors can belong to multiple clusters and clusters may evolve over time. +Our experimental results show that DeepStateGNN offers superior scalability and +faster training, while also delivering more accurate results than competitors. +It effectively handles large-scale sensor networks, outperforming other methods +in both traffic forecasting and reconstruction accuracy. -##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** -2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari +摘要:我們提出一個名為 DeepStateGNN 的新穎圖形神經網路 (GNN) 模型,用於分析交通數據,並展示其在兩個關鍵任務中的效能:預測和重建。與將每個交通感測器視為個別圖形節點的典型 GNN 方法不同,DeepStateGNN 會根據各種相似性準則將感測器群集到較高層級的圖形節點中,稱為 Deep State 節點,這會在 Deep State 圖形中產生固定數量的節點。「Deep State」節點這個術語是文字遊戲,指的是隱藏的權力網路,就像這些節點一樣,秘密地獨立於可見感測器管理交通。這些 Deep State 節點由幾個相似性因素定義,包括空間接近性(例如,位於道路網路中附近的感測器)、功能相似性(例如,位於類似類型高速公路上的感測器)以及特定條件下的行為相似性(例如,雨中的交通行為)。這種群集方法允許動態和自適應節點分組,因為感測器可以屬於多個群集,而且群集可能會隨著時間演變。我們的實驗結果顯示,DeepStateGNN 提供了卓越的可擴充性和更快的訓練速度,同時也比競爭對手提供了更準確的結果。它有效地處理了大規模感測器網路,在交通預測和重建準確度方面都優於其他方法。 -Medical images are acquired at high resolutions with large fields of view in -order to capture fine-grained features necessary for clinical decision-making. -Consequently, training deep learning models on medical images can incur large -computational costs. In this work, we address the challenge of downsizing -medical images in order to improve downstream computational efficiency while -preserving clinically-relevant features. We introduce MedVAE, a family of six -large-scale 2D and 3D autoencoders capable of encoding medical images as -downsized latent representations and decoding latent representations back to -high-resolution images. We train MedVAE autoencoders using a novel two-stage -training approach with 1,052,730 medical images. Across diverse tasks obtained -from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent -representations in place of high-resolution images when training downstream -models can lead to efficiency benefits (up to 70x improvement in throughput) -while simultaneously preserving clinically-relevant features and (2) MedVAE can -decode latent representations back to high-resolution images with high -fidelity. Our work demonstrates that large-scale, generalizable autoencoders -can help address critical efficiency challenges in the medical domain. Our code -is available at https://github.com/StanfordMIMI/MedVAE. +##### **Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation** +2502.14523v1 by Austin A. Barr, Robert Rozman, Eddie Guo -摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 +We propose a new framework for zero-shot generation of synthetic tabular +data. Using the large language model (LLM) GPT-4o and plain-language prompting, +we demonstrate the ability to generate high-fidelity tabular data without +task-specific fine-tuning or access to real-world data (RWD) for pre-training. +To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated +synthetic data against data generated with the conditional tabular generative +adversarial network (CTGAN), across three open-access datasets: Iris, Fish +Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o +outperformed CTGAN in preserving means, 95% confidence intervals, bivariate +correlations, and data privacy of RWD, even at amplified sample sizes. Notably, +correlations between parameters were consistently preserved with appropriate +direction and strength. However, refinement is necessary to better retain +distributional characteristics. These findings highlight the potential of LLMs +in tabular data synthesis, offering an accessible alternative to generative +adversarial networks and variational autoencoders. -##### **Data-Constrained Synthesis of Training Data for De-Identification** -2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis +摘要:我們提出一個新的架構,用於合成表格資料的零次學習產生。利用大型語言模型 (LLM) GPT-4o 和自然語言提示,我們證明了在沒有特定任務微調或取得真實世界資料 (RWD) 進行預訓練的情況下,產生高保真表格資料的能力。為了對 GPT-4o 進行基準測試,我們比較了 LLM 生成的合成資料與使用條件表格生成對抗網路 (CTGAN) 生成的資料在保真度和隱私性方面的表現,比較對象是三個開放取用的資料集:鳶尾花、魚類測量和房地產估價。儘管採用零次學習方法,GPT-4o 在保留平均值、95% 信賴區間、二元關聯和 RWD 的資料隱私方面都優於 CTGAN,即使在擴增的樣本大小下也是如此。值得注意的是,參數之間的關聯始終保持適當的方向和強度。然而,需要進行改進以更好地保留分佈特徵。這些發現突顯了 LLM 在表格資料合成中的潛力,為生成對抗網路和變異自動編碼器提供了可行的替代方案。 -Many sensitive domains -- such as the clinical domain -- lack widely -available datasets due to privacy risks. The increasing generative capabilities -of large language models (LLMs) have made synthetic datasets a viable path -forward. In this study, we domain-adapt LLMs to the clinical domain and -generate synthetic clinical texts that are machine-annotated with tags for -personally identifiable information using capable encoder-based NER models. The -synthetic corpora are then used to train synthetic NER models. The results show -that training NER models using synthetic corpora incurs only a small drop in -predictive performance. The limits of this process are investigated in a -systematic ablation study -- using both Swedish and Spanish data. Our analysis -shows that smaller datasets can be sufficient for domain-adapting LLMs for data -synthesis. Instead, the effectiveness of this process is almost entirely -contingent on the performance of the machine-annotating NER models trained -using the original data. +##### **MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality** +2502.14509v1 by Artur Kot, Mikołaj Koszowski, Wojciech Chojnowski, Mieszko Rutkowski, Artur Nowakowski, Kamil Guttmann, Mikołaj Pokrywka -摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 +Does multilingual Neural Machine Translation (NMT) lead to The Curse of the +Multlinguality or provides the Cross-lingual Knowledge Transfer within a +language family? In this study, we explore multiple approaches for extending +the available data-regime in NMT and we prove cross-lingual benefits even in +0-shot translation regime for low-resource languages. With this paper, we +provide state-of-the-art open-source NMT models for translating between +selected Slavic languages. We released our models on the HuggingFace Hub +(https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) under +the CC BY 4.0 license. Slavic language family comprises morphologically rich +Central and Eastern European languages. Although counting hundreds of millions +of native speakers, Slavic Neural Machine Translation is under-studied in our +opinion. Recently, most NMT research focuses either on: high-resource languages +like English, Spanish, and German - in WMT23 General Translation Task 7 out of +8 task directions are from or to English; massively multilingual models +covering multiple language groups; or evaluation techniques. -##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** -2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu +摘要:多語言神經機器翻譯 (NMT) 是否會導致多語言的詛咒,或在語言家族中提供跨語言知識轉移?在這項研究中,我們探討了多種擴展 NMT 中可用資料範圍的方法,並證明了即使在低資源語言的零次學習翻譯中也有跨語言的優點。透過這篇論文,我們提供了最先進的開源 NMT 模型,用於翻譯選定的斯拉夫語。我們在 HuggingFace Hub (https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) 下根據 CC BY 4.0 授權發布我們的模型。斯拉夫語系包含形態豐富的中歐和東歐語言。儘管擁有數億母語人士,但我們認為斯拉夫神經機器翻譯的研究不足。最近,大多數 NMT 研究都專注於:高資源語言,例如英語、西班牙語和德語 - 在 WMT23 一般翻譯任務中,8 個任務方向中有 7 個來自英語或翻譯成英語;涵蓋多個語言群組的大規模多語言模型;或評估技術。 -Protein backbone generation plays a central role in de novo protein design -and is significant for many biological and medical applications. Although -diffusion and flow-based generative models provide potential solutions to this -challenging task, they often generate proteins with undesired designability and -suffer computational inefficiency. In this study, we propose a novel rectified -quaternion flow (ReQFlow) matching method for fast and high-quality protein -backbone generation. In particular, our method generates a local translation -and a 3D rotation from random noise for each residue in a protein chain, which -represents each 3D rotation as a unit quaternion and constructs its flow by -spherical linear interpolation (SLERP) in an exponential format. We train the -model by quaternion flow (QFlow) matching with guaranteed numerical stability -and rectify the QFlow model to accelerate its inference and improve the -designability of generated protein backbones, leading to the proposed ReQFlow -model. Experiments show that ReQFlow achieves state-of-the-art performance in -protein backbone generation while requiring much fewer sampling steps and -significantly less inference time (e.g., being 37x faster than RFDiffusion and -62x faster than Genie2 when generating a backbone of length 300), demonstrating -its effectiveness and efficiency. The code is available at -https://github.com/AngxiaoYue/ReQFlow. +##### **Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases** +2502.14507v1 by Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau + +This study evaluates Large Language Models' (LLMs) ability to simulate +non-native-like English use observed in human second language (L2) learners +interfered with by their native first language (L1). In dialogue-based +interviews, we prompt LLMs to mimic L2 English learners with specific L1s +(e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to +real L2 learner data. Our analysis examines L1-driven linguistic biases, such +as reference word usage and avoidance behaviors, using information-theoretic +and distributional density measures. Results show that modern LLMs (e.g., +Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed +in human L2 data, with distinct influences from various languages (e.g., +Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu +influences noun-verb collocations). Our results reveal the potential of LLMs +for L2 dialogue generation and evaluation for future educational applications. + +摘要:本研究評估大型語言模型 (LLM) 模擬非母語英語使用者的能力,這些使用者會受到母語 (L1) 干擾,而母語是第二語言 (L2) 學習者。在基於對話的訪談中,我們提示 LLM 模仿具有特定 L1(例如日語、泰語、烏爾都語)的 L2 英語學習者,並比較七種語言的輸出與真實的 L2 學習者資料。我們的分析使用資訊理論和分佈密度測量來檢視 L1 驅動的語言偏差,例如參考詞使用和避免行為。結果顯示,現代 LLM(例如 Qwen2.5、LLAMA3.3、DeepseekV3、GPT-4o)複製了在人類 L2 資料中觀察到的 L1 相依模式,並受到各種語言的明顯影響(例如,日語、韓語和普通話顯著影響時態一致性,而烏爾都語影響名詞動詞搭配)。我們的結果揭示了 LLM 在 L2 對話產生和評估方面的潛力,可供未來教育應用使用。 + +##### **PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models** +2502.14504v1 by Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang + +Large Vision-Language Models (LVLMs) have demonstrated remarkable +capabilities across a range of multimodal tasks. However, their inference +efficiency is constrained by the large number of visual tokens processed during +decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token +Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level +Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the +Vision Token Re-attention phenomenon across decoder layers, we dynamically +adjust token retention rates layer by layer. Layers that exhibit stronger +attention to visual information preserve more vision tokens, while layers with +lower vision attention are aggressively pruned. Furthermore, PLPHP applies +pruning at the attention head level, enabling different heads within the same +layer to independently retain critical context. Experiments on multiple +benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and +reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of +0.46% average performance drop, while also achieving notable performance +improvements in multi-image tasks. These results highlight the effectiveness of +fine-grained token pruning and contribute to advancing the efficiency and +scalability of LVLMs. Our source code will be made publicly available. + +摘要:大型視覺語言模型 (LVLMs) 已在各種多模態任務中展現出非凡的能力。然而,其推理效率受到解碼過程中處理的大量視覺符號的限制。為了應對這一挑戰,我們提出逐層逐頭視覺符號剪枝 (PLPHP),這是一種包括層級保留率分配和頭級視覺符號剪枝的兩級細粒度剪枝方法。受解碼器層中視覺符號重新關注現象的啟發,我們動態地逐層調整符號保留率。對視覺資訊表現出更強關注力的層保留更多視覺符號,而視覺關注力較低的層則被積極剪枝。此外,PLPHP 在關注頭級別應用剪枝,使同一層中的不同頭部可以獨立保留關鍵上下文。在多個基準測試上的實驗表明,PLPHP 的解碼速度提高了 18%,且將鍵值快取 (KV 快取) 大小減少了 50% 以上,而代價僅為平均效能下降 0.46%,同時還在多影像任務中實現了顯著的效能提升。這些結果突顯了細粒度符號剪枝的有效性,並有助於提升 LVLMs 的效率和可擴充性。我們的原始碼將公開提供。 + +##### **How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?** +2502.14502v1 by Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov + +The performance of Large Language Models (LLMs) on many tasks is greatly +limited by the knowledge learned during pre-training and stored in the model's +parameters. Low-rank adaptation (LoRA) is a popular and efficient training +technique for updating or domain-specific adaptation of LLMs. In this study, we +investigate how new facts can be incorporated into the LLM using LoRA without +compromising the previously learned knowledge. We fine-tuned +Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our +experiments have shown that the best results are obtained when the training +data contains a mixture of known and new facts. However, this approach is still +potentially harmful because the model's performance on external +question-answering benchmarks declines after such fine-tuning. When the +training data is biased towards certain entities, the model tends to regress to +few overrepresented answers. In addition, we found that the model becomes more +confident and refuses to provide an answer in only few cases. These findings +highlight the potential pitfalls of LoRA-based LLM updates and underscore the +importance of training data composition and tuning parameters to balance new +knowledge integration and general model capabilities. -摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 +摘要:大型語言模型 (LLM) 在許多任務上的表現受到預訓練期間學到的知識和儲存在模型參數中的知識的極大限制。低階適應 (LoRA) 是一種流行且有效的訓練技術,用於更新或 LLM 的特定領域適應。在這項研究中,我們探討如何使用 LoRA 將新事實納入 LLM,同時不損害先前學到的知識。我們使用不同數量的知識微調 Llama-3.1-8B-instruct。我們的實驗表明,當訓練資料包含已知和新事實的混合時,會獲得最佳結果。然而,這種方法仍然具有潛在的危害性,因為模型在外部問答基準上的表現會在這種微調後下降。當訓練資料偏向於某些實體時,模型傾向於回歸到少數過度表示的答案。此外,我們發現模型變得更有信心,並且在極少數情況下拒絕提供答案。這些發現突顯了基於 LoRA 的 LLM 更新的潛在缺點,並強調了訓練資料組成和調整參數以平衡新知識整合和一般模型能力的重要性。 -##### **MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models** -2502.14302v1 by Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding +##### **Towards a Perspectivist Turn in Argument Quality Assessment** +2502.14501v1 by Julia Romberg, Maximilian Maurer, Henning Wachsmuth, Gabriella Lapesa -Advancements in Large Language Models (LLMs) and their increasing use in -medical question-answering necessitate rigorous evaluation of their -reliability. A critical challenge lies in hallucination, where models generate -plausible yet factually incorrect outputs. In the medical domain, this poses -serious risks to patient safety and clinical decision-making. To address this, -we introduce MedHallu, the first benchmark specifically designed for medical -hallucination detection. MedHallu comprises 10,000 high-quality question-answer -pairs derived from PubMedQA, with hallucinated answers systematically generated -through a controlled pipeline. Our experiments show that state-of-the-art LLMs, -including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, -struggle with this binary hallucination detection task, with the best model -achieving an F1 score as low as 0.625 for detecting "hard" category -hallucinations. Using bidirectional entailment clustering, we show that -harder-to-detect hallucinations are semantically closer to ground truth. -Through experiments, we also show incorporating domain-specific knowledge and -introducing a "not sure" category as one of the answer categories improves the -precision and F1 scores by up to 38% relative to baselines. +The assessment of argument quality depends on well-established logical, +rhetorical, and dialectical properties that are unavoidably subjective: +multiple valid assessments may exist, there is no unequivocal ground truth. +This aligns with recent paths in machine learning, which embrace the +co-existence of different perspectives. However, this potential remains largely +unexplored in NLP research on argument quality. One crucial reason seems to be +the yet unexplored availability of suitable datasets. We fill this gap by +conducting a systematic review of argument quality datasets. We assign them to +a multi-layered categorization targeting two aspects: (a) What has been +annotated: we collect the quality dimensions covered in datasets and +consolidate them in an overarching taxonomy, increasing dataset comparability +and interoperability. (b) Who annotated: we survey what information is given +about annotators, enabling perspectivist research and grounding our +recommendations for future actions. To this end, we discuss datasets suitable +for developing perspectivist models (i.e., those containing individual, +non-aggregated annotations), and we showcase the importance of a controlled +selection of annotators in a pilot study. -摘要:大型語言模型 (LLM) 的進步及其在醫療問答中的使用日益增加,因此需要嚴格評估其可靠性。一個關鍵的挑戰在於幻覺,模型會產生看似合理但事實上不正確的輸出。在醫療領域,這對患者安全和臨床決策構成嚴重風險。為了解決此問題,我們推出了 MedHallu,這是第一個專門設計用於檢測醫療幻覺的基準。MedHallu 包含 10,000 個從 PubMedQA 衍生的高品質問答對,並透過受控管道系統性地產生幻覺答案。我們的實驗顯示,包括 GPT-4o、Llama-3.1 和經過醫學微調的 UltraMedical 在內的最新 LLM 難以執行這個二元幻覺檢測任務,最佳模型在檢測「困難」類別幻覺時達到的 F1 分數低至 0.625。使用雙向蘊涵聚類,我們表明較難檢測的幻覺在語義上更接近真實。透過實驗,我們還表明,納入特定領域的知識並將「不確定」類別作為其中一個答案類別,可以將精確度和 F1 分數相對於基線提高多達 38%。 +摘要:論證品質的評估取決於根深蒂固的邏輯、修辭和辯證屬性,這些屬性難免具有主觀性:可能存在多種有效的評估,沒有明確的真實依據。這與機器學習中最近的途徑一致,這些途徑接受了不同觀點的共存。然而,這種潛力在論證品質的 NLP 研究中仍然很大程度上未被探索。一個關鍵原因似乎是尚未探索合適的資料集的可用性。我們通過對論證品質資料集進行系統性回顧來填補這一空白。我們將它們分配到一個多層次分類,針對兩個方面:(a) 已註釋的內容:我們收集資料集中涵蓋的品質維度,並將它們整合到一個總體分類法中,提高資料集的可比性和互操作性。(b) 誰做了註釋:我們調查了關於註釋者的哪些資訊,使觀點主義研究成為可能,並為我們對未來行動的建議奠定基礎。為此,我們討論了適合開發觀點主義模型的資料集(即那些包含個別、非聚合註釋的資料集),並在試驗研究中展示了受控選擇註釋者的重要性。 -##### **EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement** -2502.14260v1 by Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang +##### **MLGym: A New Framework and Benchmark for Advancing AI Research Agents** +2502.14499v1 by Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu -Over the past decade, generative models have achieved significant success in -enhancement fundus images.However, the evaluation of these models still -presents a considerable challenge. A comprehensive evaluation benchmark for -fundus image enhancement is indispensable for three main reasons: 1) The -existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to -downstream real-world clinical research (e.g., Vessel morphology consistency). -2) There is a lack of comprehensive evaluation for both paired and unpaired -enhancement methods, along with the need for expert protocols to accurately -assess clinical value. 3) An ideal evaluation system should provide insights to -inform future developments of fundus image enhancement. To this end, we propose -a novel comprehensive benchmark, EyeBench, to provide insights that align -enhancement models with clinical needs, offering a foundation for future work -to improve the clinical relevance and applicability of generative models for -fundus image enhancement. EyeBench has three appealing properties: 1) -multi-dimensional clinical alignment downstream evaluation: In addition to -evaluating the enhancement task, we provide several clinically significant -downstream tasks for fundus images, including vessel segmentation, DR grading, -denoising generalization, and lesion segmentation. 2) Medical expert-guided -evaluation design: We introduce a novel dataset that promote comprehensive and -fair comparisons between paired and unpaired methods and includes a manual -evaluation protocol by medical experts. 3) Valuable insights: Our benchmark -study provides a comprehensive and rigorous evaluation of existing methods -across different downstream tasks, assisting medical experts in making informed -choices. Additionally, we offer further analysis of the challenges faced by -existing methods. The code is available at -\url{https://github.com/Retinal-Research/EyeBench} +We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for +evaluating and developing LLM agents on AI research tasks. This is the first +Gym environment for machine learning (ML) tasks, enabling research on +reinforcement learning (RL) algorithms for training such agents. MLGym-bench +consists of 13 diverse and open-ended AI research tasks from diverse domains +such as computer vision, natural language processing, reinforcement learning, +and game theory. Solving these tasks requires real-world AI research skills +such as generating new ideas and hypotheses, creating and processing data, +implementing ML methods, training models, running experiments, analyzing the +results, and iterating through this process to improve on a given task. We +evaluate a number of frontier large language models (LLMs) on our benchmarks +such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 +Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate +models or agents, generate synthetic data at scale, as well as develop new +learning algorithms for training agents on AI research tasks. We find that +current frontier models can improve on the given baselines, usually by finding +better hyperparameters, but do not generate novel hypotheses, algorithms, +architectures, or substantial improvements. We open-source our framework and +benchmark to facilitate future research in advancing the AI research +capabilities of LLM agents. -摘要:在過去的十年中,生成模型在增強眼底影像方面取得了顯著的成功。然而,這些模型的評估仍然是一個相當大的挑戰。一個全面的眼底影像增強評估基準對於三個主要原因是不可或缺的:1) 現有的去噪指標(例如 PSNR、SSIM)很難擴展到下游的真實世界臨床研究(例如血管形態一致性)。2) 缺乏對配對和非配對增強方法的全面評估,以及需要專家協議來準確評估臨床價值。3) 一個理想的評估系統應該提供見解,以告知眼底影像增強的未來發展。為此,我們提出了一個新的綜合基準 EyeBench,以提供見解,將增強模型與臨床需求相結合,為未來的研究奠定基礎,以提高生成模型在眼底影像增強方面的臨床相關性和適用性。EyeBench 有三個吸引人的特性:1) 多維臨床對齊下游評估:除了評估增強任務外,我們還為眼底影像提供了幾個臨床上重要的下游任務,包括血管分割、DR 分級、去噪泛化和病灶分割。2) 醫學專家指導的評估設計:我們引入了一個新的數據集,以促進對配對和非配對方法的全面和公平比較,並包括由醫學專家進行的手動評估協議。3) 有價值的見解:我們的基準研究提供了對現有方法在不同下游任務中的全面且嚴格的評估,協助醫學專家做出明智的選擇。此外,我們還進一步分析了現有方法面臨的挑戰。程式碼可在 \url{https://github.com/Retinal-Research/EyeBench} 獲得 +摘要:我們推出 Meta MLGym 和 MLGym-Bench,一個用於評估和開發 AI 研究任務中 LLM 代理的新架構和基準。這是第一個用於機器學習 (ML) 任務的 Gym 環境,可針對訓練此類代理的強化學習 (RL) 演算法進行研究。MLGym-bench 包含 13 項來自不同領域的開放式 AI 研究任務,例如電腦視覺、自然語言處理、強化學習和博弈論。解決這些任務需要實際的 AI 研究技能,例如產生新想法和假設、建立和處理資料、實作 ML 方法、訓練模型、執行實驗、分析結果,並透過此流程反覆運算來改善特定任務。我們在基準上評估許多前沿大型語言模型 (LLM),例如 Claude-3.5-Sonnet、Llama-3.1 405B、GPT-4o、o1-preview 和 Gemini-1.5 Pro。我們的 MLGym 架構讓新增任務、整合和評估模型或代理、大規模產生合成資料,以及開發新的學習演算法以訓練 AI 研究任務中的代理變得容易。我們發現目前的邊界模型可以改善既定的基準,通常是透過尋找更好的超參數,但不會產生新穎的假設、演算法、架構或實質性的改進。我們開放原始碼架構和基準,以促進未來在提升 LLM 代理的 AI 研究能力方面的研究。 -##### **Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning** -2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal +##### **Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups** +2502.14497v1 by Felix Drinkall, Stefan Zohren, Michael McMahon, Janet B. Pierrehumbert -Large language models (LLMs) have achieved remarkable performance in -generating human-like text and solving reasoning tasks of moderate complexity, -such as question-answering and mathematical problem-solving. However, their -capabilities in tasks requiring deeper cognitive skills, such as common-sense -understanding and abstract reasoning, remain under-explored. In this paper, we -systematically evaluate abstract common-sense reasoning in LLMs using the -ConceptNet knowledge graph. We propose two prompting approaches: instruct -prompting, where models predict plausible semantic relationships based on -provided definitions, and few-shot prompting, where models identify relations -using examples as guidance. Our experiments with the gpt-4o-mini model show -that in instruct prompting, consistent performance is obtained when ranking -multiple relations but with substantial decline when the model is restricted to -predicting only one relation. In few-shot prompting, the model's accuracy -improves significantly when selecting from five relations rather than the full -set, although with notable bias toward certain relations. These results suggest -significant gaps still, even in commercially used LLMs' abstract common-sense -reasoning abilities, compared to human-level understanding. However, the -findings also highlight the promise of careful prompt engineering, based on -selective retrieval, for obtaining better performance. +Macroeconomic fluctuations and the narratives that shape them form a mutually +reinforcing cycle: public discourse can spur behavioural changes leading to +economic shifts, which then result in changes in the stories that propagate. We +show that shifts in semantic embedding space can be causally linked to +financial market shocks -- deviations from the expected market behaviour. +Furthermore, we show how partisanship can influence the predictive power of +text for market fluctuations and shape reactions to those same shocks. We also +provide some evidence that text-based signals are particularly salient during +unexpected events such as COVID-19, highlighting the value of language data as +an exogenous variable in economic forecasting. Our findings underscore the +bidirectional relationship between news outlets and market shocks, offering a +novel empirical approach to studying their effect on each other. -摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。 +摘要:宏觀經濟波動與形塑它們的敘事形成一個相互強化的循環:公共論述可能激發導致經濟變化的行為改變,進而導致宣傳故事的改變。我們表明,語義嵌入空間的轉變可能與金融市場震盪(與預期的市場行為的偏差)有因果關係。此外,我們展示了黨派立場如何影響文字對市場波動的預測能力,以及如何形塑對這些震盪的反應。我們還提供了一些證據,證明在 COVID-19 等意外事件期間,基於文字的信號特別顯著,突顯了語言資料在經濟預測中作為外生變數的價值。我們的研究結果強調了新聞媒體與市場震盪之間的雙向關係,提供了一種研究它們對彼此影響的新穎實證方法。 -##### **Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging** -2502.14064v1 by Shansong Wang, Mojtaba Safari, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang +##### **Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization** +2502.14496v1 by Zhitao He, Zijun Liu, Peng Li, May Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu -Vision foundation models (VFMs) are pre-trained on extensive image datasets -to learn general representations for diverse types of data. These models can -subsequently be fine-tuned for specific downstream tasks, significantly -boosting performance across a broad range of applications. However, existing -vision foundation models that claim to be applicable to various radiology tasks -are mostly pre-trained on 3D computed tomography (CT), which benefits from the -availability of extensive 3D CT databases. Significant differences between CT -and magnetic resonance imaging (MRI) in imaging principles, signal -characteristics, and data distribution may hinder their practical performance -and versatility in MRI-specific applications. Here, we propose Triad, a vision -foundation model for 3D MRI. Triad adopts a widely used autoencoder -architecture to learn robust representations from 131,170 3D MRI volumes and -uses organ-independent imaging descriptions to constrain the semantic -distribution of the visual modality. The above pre-training dataset is called -Triad-131K, which is currently the largest 3D MRI pre-training dataset. We -evaluate Triad across three tasks, namely, organ/tumor segmentation, -organ/cancer classification, and medical image registration, in two data -modalities (within-domain and out-of-domain) settings using 25 downstream -datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad -improves segmentation performance by 6.88% compared to nnUNet-Scratch across 17 -datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in -classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% -compared to SwinUNETR-Scratch in registration tasks across two datasets. Our -study demonstrates that pre-training can maximize performance when the data -modalities and organs of upstream and downstream tasks are consistent. +LLM-based agents have made significant advancements in interactive +environments, such as mobile operations and web browsing, and other domains +beyond computer using. Current multi-agent systems universally excel in +performance, compared to single agents, but struggle with generalization across +environments due to predefined roles and inadequate strategies for generalizing +language agents. The challenge of achieving both strong performance and good +generalization has hindered the progress of multi-agent systems for interactive +environments. To address these issues, we propose CollabUIAgents, a multi-agent +reinforcement learning framework with a novel multi-agent credit re-assignment +(CR) strategy, assigning process rewards with LLMs rather than +environment-specific rewards and learning with synthesized preference data, in +order to foster generalizable, collaborative behaviors among the role-free +agents' policies. Empirical results show that our framework improves both +performance and cross-environment generalizability of multi-agent systems. +Moreover, our 7B-parameter system achieves results on par with or exceed strong +closed-source models, and the LLM that guides the CR. We also provide insights +in using granular CR rewards effectively for environment generalization, and +accommodating trained LLMs in multi-agent systems. -摘要:視覺基礎模型 (VFM) 在廣泛的影像資料集上進行預訓練,以學習各種資料類型的通用表示。這些模型隨後可以針對特定的下游任務進行微調,大幅提升各種應用程式的效能。然而,現有的視覺基礎模型聲稱適用於各種放射學任務,但大多是針對 3D 電腦斷層攝影 (CT) 進行預訓練,這得利於廣泛的 3D CT 資料庫。CT 和磁振造影 (MRI) 在影像原理、訊號特性和資料分佈上的顯著差異,可能會阻礙其在 MRI 特定應用中的實際效能和多功能性。在此,我們提出 Triad,一個適用於 3D MRI 的視覺基礎模型。Triad 採用廣泛使用的自動編碼器架構,從 131,170 個 3D MRI 體積中學習穩健的表示,並使用與器官無關的影像描述來約束視覺模式的語義分佈。上述預訓練資料集稱為 Triad-131K,目前是最大的 3D MRI 預訓練資料集。我們在三個任務中評估 Triad,即器官/腫瘤分割、器官/癌症分類和醫學影像配準,在兩個資料模式(域內和域外)設定中使用 25 個下游資料集。透過使用 Triad 的預訓練權重初始化模型,nnUNet-Triad 在 17 個資料集中的分割效能比 nnUNet-Scratch 提升了 6.88%。Swin-B-Triad 在五個資料集的分類任務中,比 Swin-B-Scratch 提升了 3.97%。SwinUNETR-Triad 在兩個資料集的配準任務中,比 SwinUNETR-Scratch 提升了 4.00%。我們的研究證明,當上游和下游任務的資料模式和器官一致時,預訓練可以最大化效能。 +摘要:基於 LLM 的代理在互動式環境中取得重大進展,例如行動運算和網頁瀏覽,以及電腦使用以外的其他領域。與單一代理相比,目前的 Multi-Agent 系統在效能上普遍表現出色,但由於預先定義的角色和不適當的語言代理概化策略,導致難以跨環境概化。在互動式環境中,同時達成強大效能和良好概化的挑戰,阻礙了 Multi-Agent 系統的進展。為了解決這些問題,我們提出 CollabUIAgents,這是一個 Multi-Agent 強化學習架構,具備創新的 Multi-Agent 信用重新分配 (CR) 策略,使用 LLM 而不是特定於環境的獎勵來分配程序獎勵,並透過綜合偏好資料進行學習,以促進無角色代理政策之間可概化的協作行為。經驗結果顯示,我們的架構同時改善了 Multi-Agent 系統的效能和跨環境概化能力。此外,我們的 7B 參數系統在效能上與強大的閉源模型和引導 CR 的 LLM 相當或超越它們。我們也提供見解,說明如何有效地使用細粒化的 CR 獎勵來進行環境概化,以及如何在 Multi-Agent 系統中容納受過訓練的 LLM。 -##### **VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare** -2502.13775v1 by Anudeex Shetty, Amin Beheshti, Mark Dras, Usman Naseem +##### **StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following** +2502.14494v1 by Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu -Alignment techniques have become central to ensuring that Large Language -Models (LLMs) generate outputs consistent with human values. However, existing -alignment paradigms often model an averaged or monolithic preference, failing -to account for the diversity of perspectives across cultures, demographics, and -communities. This limitation is particularly critical in health-related -scenarios, where plurality is essential due to the influence of culture, -religion, personal values, and conflicting opinions. Despite progress in -pluralistic alignment, no prior work has focused on health, likely due to the -unavailability of publicly available datasets. To address this gap, we -introduce VITAL, a new benchmark dataset comprising 13.1K value-laden -situations and 5.4K multiple-choice questions focused on health, designed to -assess and benchmark pluralistic alignment methodologies. Through extensive -evaluation of eight LLMs of varying sizes, we demonstrate that existing -pluralistic alignment techniques fall short in effectively accommodating -diverse healthcare beliefs, underscoring the need for tailored AI alignment in -specific domains. This work highlights the limitations of current approaches -and lays the groundwork for developing health-specific alignment solutions. +Multi-turn instruction following capability constitutes a core competency of +large language models (LLMs) in real-world applications. Existing evaluation +benchmarks predominantly focus on fine-grained constraint satisfaction and +domain-specific capability assessment, yet overlook the crucial structural +dependency between dialogue turns that distinguishes multi-turn from +single-turn interactions. This structural dependency not only reflects user +intent but also establishes a second dimension for instruction following +evaluation beyond constraint satisfaction. To address this gap, we propose +StructFlowBench, a multi-turn instruction following benchmark with structural +flow modeling. The benchmark innovatively defines a structural flow framework +comprising six fundamental inter-turn relationships, which not only introduces +novel structural constraints for model evaluation but also serves as generation +parameters for creating customized dialogue flows tailored to specific +scenarios. Adopting established LLM-based automatic evaluation methodologies, +we conduct systematic evaluations of 13 leading open-source and closed-source +LLMs. Experimental results reveal significant deficiencies in current models' +comprehension of multi-turn dialogue structures. The code is available at +\url{https://github.com/MLGroupJLU/StructFlowBench}. -摘要:對齊技術已成為確保大型語言模型 (LLM) 產生與人類價值觀一致的輸出的核心。然而,現有的對齊範例通常會建模平均或單一的偏好,無法考量跨文化、人口統計和社群的不同觀點。此限制在與健康相關的場景中特別重要,因為在這種場景中,由於文化、宗教、個人價值觀和相互衝突的意見的影響,多元性是必要的。儘管多元對齊已取得進展,但沒有任何先前的工作專注於健康,這可能是因為缺乏公開可用的資料集。為了解決此差距,我們引入了 VITAL,這是一個新的基準資料集,包含 13.1K 個價值觀念的情境和 5.4K 個選擇題,專注於健康,旨在評估和基準多元對齊方法。透過對八個不同規模的 LLM 進行廣泛評估,我們證明現有的多元對齊技術無法有效適應不同的醫療保健信念,這強調了在特定領域中需要量身打造的 AI 對齊。這項工作突顯了當前方法的限制,並為開發特定於健康的對齊解決方案奠定了基礎。 +摘要:多輪指令遵循能力構成大型語言模型 (LLM) 在現實世界應用中的核心能力。現有的評估基準主要專注於細粒度的約束滿足和特定領域的能力評估,卻忽略了多輪與單輪互動之間區別的關鍵結構依賴性。這種結構依賴性不僅反映了使用者的意圖,也為指令遵循評估建立了超越約束滿足的第二個維度。為了解決這個差距,我們提出了 StructFlowBench,一個具有結構流建模的多輪指令遵循基準。該基準創新地定義了一個結構流框架,包含六個基本的回合間關係,這不僅引入了模型評估的新結構約束,還可用作生成參數,用於創建針對特定場景定制的對話流。採用已建立的基於 LLM 的自動評估方法,我們對 13 個領先的開源和閉源 LLM 進行了系統評估。實驗結果揭示了當前模型在理解多輪對話結構方面存在顯著缺陷。程式碼可在 \url{https://github.com/MLGroupJLU/StructFlowBench} 取得。 -##### **PeerQA: A Scientific Question Answering Dataset from Peer Reviews** -2502.13668v1 by Tim Baumgärtner, Ted Briscoe, Iryna Gurevych +##### **Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk** +2502.14491v1 by Elija Perrier -We present PeerQA, a real-world, scientific, document-level Question -Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, -which contain questions that reviewers raised while thoroughly examining the -scientific article. Answers have been annotated by the original authors of each -paper. The dataset contains 579 QA pairs from 208 academic articles, with a -majority from ML and NLP, as well as a subset of other scientific communities -like Geoscience and Public Health. PeerQA supports three critical tasks for -developing practical QA systems: Evidence retrieval, unanswerable question -classification, and answer generation. We provide a detailed analysis of the -collected dataset and conduct experiments establishing baseline systems for all -three tasks. Our experiments and analyses reveal the need for -decontextualization in document-level retrieval, where we find that even simple -decontextualization approaches consistently improve retrieval performance -across architectures. On answer generation, PeerQA serves as a challenging -benchmark for long-context modeling, as the papers have an average size of 12k -tokens. Our code and data is available at https://github.com/UKPLab/peerqa. +Evaluating AI safety requires statistically rigorous methods and risk metrics +for understanding how the use of AI affects aggregated risk. However, much AI +safety literature focuses upon risks arising from AI models in isolation, +lacking consideration of how modular use of AI affects risk distribution of +workflow components or overall risk metrics. There is also a lack of +statistical grounding enabling sensitisation of risk models in the presence of +absence of AI to estimate causal contributions of AI. This is in part due to +the dearth of AI impact data upon which to fit distributions. In this work, we +address these gaps in two ways. First, we demonstrate how scenario modelling +(grounded in established statistical techniques such as Markov chains, copulas +and Monte Carlo simulation) can be used to model AI risk holistically. Second, +we show how lookalike distributions from phenomena analogous to AI can be used +to estimate AI impacts in the absence of directly observable data. We +demonstrate the utility of our methods for benchmarking cumulative AI risk via +risk analysis of a logistic scenario simulations. -摘要:我們提出 PeerQA,一個真實世界、科學的、文件層級的問答 (QA) 資料集。PeerQA 問題來自於同行評審,其中包含審查者在徹底審查科學文章時提出的問題。答案是由每篇論文的原始作者註解的。此資料集包含來自 208 篇學術文章的 579 個 QA 對,其中大部分來自 ML 和 NLP,以及其他科學社群(例如地球科學和公共衛生)的子集。PeerQA 支援開發實用 QA 系統的三項重要任務:證據檢索、無解答問題分類和答案產生。我們提供收集到的資料集的詳細分析,並進行實驗,為所有三項任務建立基準系統。我們的實驗和分析揭示了在文件層級檢索中去脈絡化的必要性,我們發現即使是簡單的去脈絡化方法也能持續改善跨架構的檢索效能。在答案產生方面,PeerQA 是一個用於長脈絡建模的具挑戰性基準,因為論文的平均大小為 12k 個符號。我們的程式碼和資料可於 https://github.com/UKPLab/peerqa 取得。 +摘要:評估 AI 安全性需要嚴格的統計方法和風險指標,以了解 AI 的使用如何影響累積風險。然而,許多 AI 安全性文獻著重於 AI 模型孤立產生的風險,缺乏考量 AI 的模組化使用如何影響工作流程組件的風險分佈或整體風險指標。在有或沒有 AI 的情況下,統計基礎也缺乏讓風險模型敏感化的能力,以估計 AI 的因果關係貢獻。這部分是因為缺乏 AI 影響資料來擬合分佈。在這項研究中,我們以兩種方式解決這些差距。首先,我們展示情境建模(建立在已建立的統計技術上,例如馬可夫鏈、copula 和蒙地卡羅模擬)如何用於整體建模 AI 風險。其次,我們展示如何使用類似於 AI 現象的相似分佈來估計在沒有直接可觀察資料的情況下 AI 的影響。我們透過後勤情境模擬的風險分析,展示了我們的方法對於評量累積 AI 風險的效用。 -##### **Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs** -2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu +##### **Temporal Misalignment and Probabilistic Neurons** +2502.14487v1 by Velibor Bojković, Xiaofeng Wu, Bin Gu -Data augmentation is necessary for graph representation learning due to the -scarcity and noise present in graph data. Most of the existing augmentation -methods overlook the context information inherited from the dataset as they -rely solely on the graph structure for augmentation. Despite the success of -some large language model-based (LLM) graph learning methods, they are mostly -white-box which require access to the weights or latent features from the -open-access LLMs, making them difficult to be democratized for everyone as -existing LLMs are mostly closed-source for commercial considerations. To -overcome these limitations, we propose a black-box context-driven graph data -augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the -text prompt as context-related information, we task the LLM with generating -knowledge graphs (KGs), which allow us to capture the structural interactions -from the text outputs. We then design a dynamic merging schema to -stochastically integrate the LLM-generated KGs into the original graph during -training. To control the sparsity of the augmented graph, we further devise a -granularity-aware prompting strategy and an instruction fine-tuning module, -which seamlessly generates text prompts according to different granularity -levels of the dataset. Extensive experiments on various graph learning tasks -validate the effectiveness of our method over existing graph data augmentation -methods. Notably, our approach excels in scenarios involving electronic health -records (EHRs), which validates its maximal utilization of contextual -knowledge, leading to enhanced predictive performance and interpretability. +Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to +Artificial Neural Networks (ANNs) by mimicking biological neural principles, +establishing them as a promising approach to mitigate the increasing energy +demands of large-scale neural models. However, fully harnessing the +capabilities of SNNs remains challenging due to their discrete signal +processing and temporal dynamics. ANN-SNN conversion has emerged as a practical +approach, enabling SNNs to achieve competitive performance on complex machine +learning tasks. In this work, we identify a phenomenon in the ANN-SNN +conversion framework, termed temporal misalignment, in which random spike +rearrangement across SNN layers leads to performance improvements. Based on +this observation, we introduce biologically plausible two-phase probabilistic +(TPP) spiking neurons, further enhancing the conversion process. We demonstrate +the advantages of our proposed method both theoretically and empirically +through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet +across a variety of architectures, achieving state-of-the-art results. -摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。 +摘要:脈衝神經網路 (SNN) 模仿生物神經原理,提供了一種比人工神經網路 (ANN) 更省能的替代方案,確立了它們作為緩解大型神經模型日益增長能耗需求的一種有前途的方法。然而,由於 SNN 的離散訊號處理和時間動態,要充分利用 SNN 的功能仍然具有挑戰性。ANN-SNN 轉換已經成為一種實用的方法,使 SNN 能夠在複雜機器學習任務中實現競爭性能。在這項工作中,我們在 ANN-SNN 轉換框架中發現了一種現象,稱為時間錯位,其中隨機脈衝在 SNN 層之間重新排列會導致性能提升。基於這一觀察,我們引入了生物學上合理的兩階段機率 (TPP) 脈衝神經元,進一步增強了轉換過程。我們通過在 CIFAR-10/100、CIFAR10-DVS 和 ImageNet 上對各種架構進行綜合實驗,從理論和經驗上證明了我們提出的方法的優點,取得了最先進的結果。 -##### **MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis** -2502.13524v1 by Wei Dai, Steven Wang, Jun Liu +##### **How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation** +2502.14486v1 by Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang, Zhongyu Wei -Efficient evaluation of three-dimensional (3D) medical images is crucial for -diagnostic and therapeutic practices in healthcare. Recent years have seen a -substantial uptake in applying deep learning and computer vision to analyse and -interpret medical images. Traditional approaches, such as convolutional neural -networks (CNNs) and vision transformers (ViTs), face significant computational -challenges, prompting the need for architectural advancements. Recent efforts -have led to the introduction of novel architectures like the ``Mamba'' model as -alternative solutions to traditional CNNs or ViTs. The Mamba model excels in -the linear processing of one-dimensional data with low computational demands. -However, Mamba's potential for 3D medical image analysis remains underexplored -and could face significant computational challenges as the dimension increases. -This manuscript presents MobileViM, a streamlined architecture for efficient -segmentation of 3D medical images. In the MobileViM network, we invent a new -dimension-independent mechanism and a dual-direction traversing approach to -incorporate with a vision-Mamba-based framework. MobileViM also features a -cross-scale bridging technique to improve efficiency and accuracy across -various medical imaging modalities. With these enhancements, MobileViM achieves -segmentation speeds exceeding 90 frames per second (FPS) on a single graphics -processing unit (i.e., NVIDIA RTX 4090). This performance is over 24 FPS faster -than the state-of-the-art deep learning models for processing 3D images with -the same computational resources. In addition, experimental evaluations -demonstrate that MobileViM delivers superior performance, with Dice similarity -scores reaching 92.72%, 86.69%, 80.46%, and 77.43% for PENGWIN, BraTS2024, -ATLAS, and Toothfairy2 datasets, respectively, which significantly surpasses -existing models. +Jailbreak attacks, where harmful prompts bypass generative models' built-in +safety, raise serious concerns about model vulnerability. While many defense +methods have been proposed, the trade-offs between safety and helpfulness, and +their application to Large Vision-Language Models (LVLMs), are not well +understood. This paper systematically examines jailbreak defenses by reframing +the standard generation task as a binary classification problem to assess model +refusal tendencies for both harmful and benign queries. We identify two key +defense mechanisms: safety shift, which increases refusal rates across all +queries, and harmfulness discrimination, which improves the model's ability to +distinguish between harmful and benign inputs. Using these mechanisms, we +develop two ensemble defense strategies-inter-mechanism ensembles and +intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the +MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these +strategies effectively improve model safety or optimize the trade-off between +safety and helpfulness. -摘要:有效評估三維 (3D) 醫學影像對於醫療保健中的診斷和治療實務至關重要。近年來,將深度學習和電腦視覺應用於分析和詮釋醫學影像的應用大幅增加。傳統方法,例如卷積神經網路 (CNN) 和視覺Transformer (ViT),面臨重大的運算挑戰,促使需要架構上的進步。最近的努力已導致引進創新的架構,例如「Mamba」模型,作為傳統 CNN 或 ViT 的替代解決方案。Mamba 模型擅長以低運算需求進行一維資料的線性處理。然而,Mamba 在 3D 醫學影像分析方面的潛力仍未被充分探索,並且隨著維度的增加可能會面臨重大的運算挑戰。本手稿提出 MobileViM,這是一種簡化的架構,可有效分割 3D 醫學影像。在 MobileViM 網路中,我們發明了一種新的與維度無關的機制和雙向遍歷方法,以與基於視覺 Mamba 的架構結合。MobileViM 還具備跨尺度橋接技術,以提高各種醫學影像模式的效率和準確性。透過這些增強功能,MobileViM 在單一顯示卡 (即 NVIDIA RTX 4090) 上達到了每秒超過 90 幀 (FPS) 的分割速度。此效能比現有最先進的深度學習模型快了超過 24 FPS,這些模型使用相同的運算資源處理 3D 影像。此外,實驗評估證明 MobileViM 提供了卓越的效能,Dice 相似性評分對於 PENGWIN、BraTS2024、ATLAS 和 Toothfairy2 資料集分別達到 92.72%、86.69%、80.46% 和 77.43%,顯著超越現有模型。 +摘要:越獄攻擊,其中有害提示繞過生成模型內建的安全機制,引發了對模型漏洞的嚴重疑慮。雖然已提出許多防禦方法,但安全性與有益性之間的取捨,以及它們在大型視覺語言模型 (LVLMs) 中的應用,尚未得到充分理解。本文透過將標準生成任務重新定義為二元分類問題,系統性地檢視越獄防禦,以評估模型對有害和良性查詢的拒絕傾向。我們找出兩種關鍵的防禦機制:安全轉移,這會提高所有查詢的拒絕率,以及危害區分,這會提升模型區分有害和良性輸入的能力。使用這些機制,我們開發出兩種整體防禦策略,機制間整體和機制內整體,以平衡安全性與有益性。在使用 LLaVA-1.5 模型的 MM-SafetyBench 和 MOSSBench 資料集上進行的實驗顯示,這些策略有效地提升了模型安全性,或最佳化了安全性與有益性之間的取捨。 -##### **Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion** -2502.13509v1 by Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang +##### **NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models** +2502.14482v1 by Chenlu Guo, Yuan Wu, Yi Chang -Large language models (LLMs) have shown remarkable performance in -vision-language tasks, but their application in the medical field remains -underexplored, particularly for integrating structured time series data with -unstructured clinical notes. In clinical practice, dynamic time series data -such as lab test results capture critical temporal patterns, while clinical -notes provide rich semantic context. Merging these modalities is challenging -due to the inherent differences between continuous signals and discrete text. -To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal -framework that employs prompt-guided learning to unify these heterogeneous data -types. Our approach leverages lightweight anomaly detection to generate anomaly -captions that serve as prompts, guiding the encoding of raw time series data -into informative embeddings. These embeddings are aligned with textual -representations in a shared latent space, preserving fine-grained temporal -nuances alongside semantic insights. Furthermore, our framework incorporates -tailored self-supervised objectives to enhance both intra- and inter-modal -alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world -datasets, and the results demonstrate that our method consistently outperforms -state-of-the-art approaches. +Parameter-efficient fine-tuning (PEFT) is essential for adapting large +language models (LLMs), with low-rank adaptation (LoRA) being the most popular +approach. However, LoRA suffers from slow convergence, and some recent LoRA +variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD) +for initialization, leading to expensive computation. To mitigate these +problems, we use the Nystr\"om method, which follows a three-matrix +manipulation. We first introduce StructuredLoRA (SLoRA), which investigates +adding a small intermediate matrix between the low-rank matrices A and B. +Secondly, we propose Nystr\"omLoRA (NLoRA), which leverages Nystr\"om-based +initialization for SLoRA to improve its effectiveness and efficiency. Finally, +we propose IntermediateTune (IntTune), which explores fine-tuning exclusively +on the intermediate matrix of NLoRA to further boost LLM efficiency. We +evaluate our methods on five natural language generation (NLG) tasks and eight +natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve +accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with +only 3.67 million additional trainable parameters. IntTune improves average NLG +performance over LoRA by 7.45% while using only 1.25% of its parameters. These +results demonstrate the efficiency and effectiveness of our approach in +enhancing model performance with minimal parameter overhead. -摘要:大型語言模型(LLM)在視覺語言任務中表現出色,但其在醫療領域的應用仍未得到充分探索,特別是在將結構化時間序列數據與非結構化臨床筆記整合方面。在臨床實務中,動態時間序列數據(例如實驗室檢驗結果)會擷取關鍵的時間模式,而臨床筆記則提供豐富的語意脈絡。由於連續訊號與離散文字之間的固有差異,合併這些方式具有挑戰性。為了彌補這個差距,我們引入了 ProMedTS,這是一個新穎的自監督多模態框架,採用提示引導學習來統一這些異質化的數據類型。我們的做法利用輕量級異常偵測來產生異常標題,作為提示,引導將原始時間序列數據編碼成資訊性的嵌入。這些嵌入與共享潛在空間中的文字表示對齊,同時保留細微的時間差異和語意見解。此外,我們的框架納入了客製化的自監督目標,以增強模態內和模態間對齊。我們在疾病診斷任務中使用真實世界的數據集評估 ProMedTS,結果表明,我們的模型始終優於最先進的方法。 +摘要:參數高效微調 (PEFT) 對於調整大型語言模型 (LLM) 至關重要,其中低秩調整 (LoRA) 是最受歡迎的方法。然而,LoRA 存在收斂速度慢的問題,而一些最近的 LoRA 變體,例如 PiSSA,主要依賴奇異值分解 (SVD) 進行初始化,導致運算成本高昂。為了減輕這些問題,我們使用了 Nystr\"om 方法,它遵循三矩陣操作。我們首先介紹 StructuredLoRA (SLoRA),它研究在低秩矩陣 A 和 B 之間添加一個小的中間矩陣。其次,我們提出了 Nystr\"omLoRA (NLoRA),它利用基於 Nystr\"om 的初始化方法為 SLoRA 提升其有效性和效率。最後,我們提出了 IntermediateTune (IntTune),它探討了僅對 NLoRA 的中間矩陣進行微調,以進一步提升 LLM 效率。我們在五項自然語言生成 (NLG) 任務和八項自然語言理解 (NLU) 任務上評估了我們的這些方法。在 GSM8K 上,SLoRA 和 NLoRA 分別達到了 56.48% 和 57.70% 的準確率,比 LoRA 高出 33.52% 和 36.41%,而僅增加了 367 萬個可訓練參數。IntTune 在僅使用 LoRA 1.25% 的參數的情況下,將平均 NLG 效能提升了 7.45%。這些結果證明了我們的方法在以最少的參數開銷提升模型效能方面的效率和有效性。 -##### **Towards a perturbation-based explanation for medical AI as differentiable programs** -2502.14001v1 by Takeshi Abe, Yoshiyuki Asai +##### **Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression** +2502.14477v1 by Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang -Recent advancement in machine learning algorithms reaches a point where -medical devices can be equipped with artificial intelligence (AI) models for -diagnostic support and routine automation in clinical settings. In medicine and -healthcare, there is a particular demand for sufficient and objective -explainability of the outcome generated by AI models. However, AI models are -generally considered as black boxes due to their complexity, and the -computational process leading to their response is often opaque. Although -several methods have been proposed to explain the behavior of models by -evaluating the importance of each feature in discrimination and prediction, -they may suffer from biases and opacities arising from the scale and sampling -protocol of the dataset used for training or testing. To overcome the -shortcomings of existing methods, we explore an alternative approach to provide -an objective explanation of AI models that can be defined independently of the -learning process and does not require additional data. As a preliminary study -for this direction of research, this work examines a numerical availability of -the Jacobian matrix of deep learning models that measures how stably a model -responses against small perturbations added to the input. The indicator, if -available, are calculated from a trained AI model for a given target input. -This is a first step towards a perturbation-based explanation, which will -assist medical practitioners in understanding and interpreting the response of -the AI model in its clinical application. +Handling long-context sequences efficiently remains a significant challenge +in large language models (LLMs). Existing methods for token selection in +sequence extrapolation either employ a permanent eviction strategy or select +tokens by chunk, which may lead to the loss of critical information. We propose +Efficient Selective Attention (ESA), a novel approach that extends context +length by efficiently selecting the most critical tokens at the token level to +compute attention. ESA reduces the computational complexity of token selection +by compressing query and key vectors into lower-dimensional representations. We +evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using +open-source LLMs with context lengths of 8k and 32k. ESA outperforms other +selective attention methods, especially in tasks requiring the retrieval of +multiple pieces of information, achieving comparable performance to +full-attention extrapolation methods across various tasks, with superior +results in certain tasks. -摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 +摘要:在大型語言模型 (LLM) 中,有效處理長語境序列仍然是一項重大挑戰。現有的序列外推標記選擇方法採用永久驅逐策略或按塊選擇標記,這可能會導致關鍵資訊遺失。我們提出高效選擇性注意 (ESA),這是一種新穎的方法,它透過在標記層級有效選擇最關鍵的標記來計算注意,從而延伸語境長度。ESA 透過將查詢和關鍵向量壓縮成較低維度的表示,來降低標記選擇的運算複雜度。我們使用開放原始碼 LLM,在語境長度為 8k 和 32k 的情況下,對長序列基準進行評估,最大長度達 256k。ESA 的表現優於其他選擇性注意方法,特別是在需要擷取多條資訊的任務中,在各種任務中達到與全注意外推方法相當的效能,並且在某些任務中獲得更佳的結果。 -##### **RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering** -2502.13361v1 by Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou +##### **Argument-Based Comparative Question Answering Evaluation Benchmark** +2502.14476v1 by Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann -Medical question answering requires extensive access to specialized -conceptual knowledge. The current paradigm, Retrieval-Augmented Generation -(RAG), acquires expertise medical knowledge through large-scale corpus -retrieval and uses this knowledge to guide a general-purpose large language -model (LLM) for generating answers. However, existing retrieval approaches -often overlook the importance of factual knowledge, which limits the relevance -of retrieved conceptual knowledge and restricts its applicability in real-world -scenarios, such as clinical decision-making based on Electronic Health Records -(EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval -framework that retrieves both relevant factual and conceptual knowledge from -dual sources (i.e., EHRs and the corpus), allowing them to interact and refine -each another. Through extensive evaluation across three factual-aware medical -question answering benchmarks, RGAR establishes a new state-of-the-art -performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model -with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings -demonstrate the benefit of extracting factual knowledge for retrieval, which -consistently yields improved generation quality. +In this paper, we aim to solve the problems standing in the way of automatic +comparative question answering. To this end, we propose an evaluation framework +to assess the quality of comparative question answering summaries. We formulate +15 criteria for assessing comparative answers created using manual annotation +and annotation from 6 large language models and two comparative question +asnwering datasets. We perform our tests using several LLMs and manual +annotation under different settings and demonstrate the constituency of both +evaluations. Our results demonstrate that the Llama-3 70B Instruct model +demonstrates the best results for summary evaluation, while GPT-4 is the best +for answering comparative questions. All used data, code, and evaluation +results are publicly +available\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}. -摘要:醫療問題解答需要大量取得專業概念知識。目前的典範,檢索增強生成(RAG),透過大規模語料庫檢索取得專業醫療知識,並使用此知識引導通用大型語言模型(LLM)來產生答案。然而,現有的檢索方法經常忽略事實知識的重要性,這會限制檢索到的概念知識的相關性,並限制其在現實世界情境中的適用性,例如基於電子健康記錄(EHR)的臨床決策制定。本文介紹 RGAR,一個遞迴生成增強檢索架構,從雙重來源(即 EHR 和語料庫)檢索相關的事實和概念知識,讓它們互動並互相精煉。透過在三個事實感知醫療問題解答基準上進行廣泛評估,RGAR 在醫療 RAG 系統中建立了新的最先進效能。值得注意的是,採用 RGAR 的 Llama-3.1-8B-Instruct 模型超越了規模大得多的 RAG 增強型 GPT-3.5。我們的研究結果證明了提取事實知識以進行檢索的好處,這會持續產生改善的生成品質。 +摘要:在本文中,我們旨在解決阻礙自動比較性問題解答的難題。為此,我們提出一個評估框架,用於評估比較性問題解答摘要的品質。我們制定了 15 項準則,用於評估使用手動標註和來自 6 個大型語言模型和兩個比較性問題解答資料集的標註所建立的比較性答案。我們在不同的設定下使用幾個 LLM 和手動標註執行測試,並展示兩種評估的組成。我們的結果表明,Llama-3 70B Instruct 模型在摘要評估中表現最佳,而 GPT-4 在回答比較性問題方面表現最佳。所有使用過的資料、程式碼和評估結果均公開可用\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}。 -##### **Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance** -2502.13321v1 by Tejas Srinivasan, Jesse Thomason +##### **Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models** +2502.14469v1 by Aurora Polo-Rodríguez, Laura Fiorini, Erika Rovini, Filippo Cavallo, Javier Medina-Quero -Trust biases how users rely on AI recommendations in AI-assisted -decision-making tasks, with low and high levels of trust resulting in increased -under- and over-reliance, respectively. We propose that AI assistants should -adapt their behavior through trust-adaptive interventions to mitigate such -inappropriate reliance. For instance, when user trust is low, providing an -explanation can elicit more careful consideration of the assistant's advice by -the user. In two decision-making scenarios -- laypeople answering science -questions and doctors making medical diagnoses -- we find that providing -supporting and counter-explanations during moments of low and high trust, -respectively, yields up to 38% reduction in inappropriate reliance and 20% -improvement in decision accuracy. We are similarly able to reduce over-reliance -by adaptively inserting forced pauses to promote deliberation. Our results -highlight how AI adaptation to user trust facilitates appropriate reliance, -presenting exciting avenues for improving human-AI collaboration. +This work presents a novel architecture for context-aware interactions within +smart environments, leveraging Large Language Models (LLMs) to enhance user +experiences. Our system integrates user location data obtained through UWB tags +and sensor-equipped smart homes with real-time human activity recognition (HAR) +to provide a comprehensive understanding of user context. This contextual +information is then fed to an LLM-powered chatbot, enabling it to generate +personalised interactions and recommendations based on the user's current +activity and environment. This approach moves beyond traditional static chatbot +interactions by dynamically adapting to the user's real-time situation. A case +study conducted from a real-world dataset demonstrates the feasibility and +effectiveness of our proposed architecture, showcasing its potential to create +more intuitive and helpful interactions within smart homes. The results +highlight the significant benefits of integrating LLM with real-time activity +and location data to deliver personalised and contextually relevant user +experiences. -摘要:信任偏見影響使用者在 AI 輔助決策任務中如何依賴 AI 建議,信任程度低和高分別導致依賴不足和過度依賴。我們建議 AI 助理應透過信任適應式干預調整其行為,以減輕這種不適當的依賴。例如,當使用者信任度低時,提供解釋可以引發使用者更仔細地考慮助理的建議。在兩種決策情境中——外行人回答科學問題和醫生進行醫療診斷——我們發現,分別在信任度低和高的時刻提供支持性和反向解釋,可以將不適當的依賴降低多達 38%,並將決策準確性提高 20%。我們同樣能夠透過適應性地插入強制暫停來促進審議,以減少過度依賴。我們的結果強調 AI 如何適應使用者信任以促進適當的依賴,為改善人機協作提供了令人興奮的途徑。 +摘要:本研究提出了一種創新的架構,用於在智慧環境中進行情境感知互動,利用大型語言模型 (LLM) 來提升使用者體驗。我們的系統整合了透過超寬頻標籤取得的使用者位置資料,以及配備感測器的智慧家庭,並具備即時人類活動辨識 (HAR),以全面了解使用者的情境。接著,將這些情境資訊輸入 LLM 驅動的聊天機器人,讓它能根據使用者的當前活動和環境產生個人化的互動和建議。這種方法超越了傳統的靜態聊天機器人互動,能動態地適應使用者的即時狀況。從真實世界資料集進行的案例研究,展示了我們提出的架構的可行性和有效性,突顯出它在智慧家庭中創造更直覺且有用的互動的潛力。結果突顯了將 LLM 與即時活動和位置資料整合,以提供個人化且與情境相關的使用者體驗的顯著優點。 -##### **Prediction of Clinical Complication Onset using Neural Point Processes** -2502.13290v1 by Sachini Weerasekara, Sagar Kamarthi, Jacqueline Isaacs +##### **Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing** +2502.14458v1 by Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, Albert Gu -Predicting medical events in advance within critical care settings is -paramount for patient outcomes and resource management. Utilizing predictive -models, healthcare providers can anticipate issues such as cardiac arrest, -sepsis, or respiratory failure before they manifest. Recently, there has been a -surge in research focusing on forecasting adverse medical event onsets prior to -clinical manifestation using machine learning. However, while these models -provide temporal prognostic predictions for the occurrence of a specific -adverse event of interest within defined time intervals, their interpretability -often remains a challenge. In this work, we explore the applicability of neural -temporal point processes in the context of adverse event onset prediction, with -the aim of explaining clinical pathways and providing interpretable insights. -Our experiments span six state-of-the-art neural point processes and six -critical care datasets, each focusing on the onset of distinct adverse events. -This work represents a novel application class of neural temporal point -processes in event prediction. +We introduce Llamba, a family of efficient recurrent language models +distilled from Llama-3.x into the Mamba architecture. The series includes +Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput +and handle significantly larger batch sizes than Transformer-based models while +maintaining comparable benchmark performance. Furthermore, Llamba demonstrates +the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., +2024), achieving these results with less than 0.1% of the training data +typically used for models of similar size. To take full advantage of their +efficiency, we provide an optimized implementation of Llamba for +resource-constrained devices such as smartphones and edge platforms, offering a +practical and memory-efficient alternative to Transformers. Overall, Llamba +improves the tradeoff between speed, memory efficiency, and performance, making +high-quality language models more accessible. -摘要:在重症監護環境中預先預測醫療事件對於患者的預後和資源管理至關重要。利用預測模型,醫療保健提供者可以在心臟驟停、敗血症或呼吸衰竭等問題發生之前預測到這些問題。最近,專注於在臨床表現之前使用機器學習預測不良醫療事件發生的研究激增。然而,儘管這些模型為特定不良事件在定義的時間間隔內發生提供了時間預後預測,但它們的可解釋性仍然是一個挑戰。在這項工作中,我們探討了神經時間點過程在不良事件發作預測中的適用性,目的是解釋臨床途徑並提供可解釋的見解。我們的實驗涵蓋了六種最先進的神經點過程和六個重症監護資料集,每個資料集都專注於不同不良事件的發作。這項工作代表了神經時間點過程在事件預測中的一種新的應用類別。 +摘要:我們推出 Llamba,一種高效的遞迴語言模型家族,從 Llama-3.x 萃取到 Mamba 架構中。該系列包含 Llamba-1B、Llamba-3B 和 Llamba-8B,它們比基於 Transformer 的模型實現更高的推理吞吐量,並處理顯著更大的批次大小,同時保持可比較的基準效能。此外,Llamba 證明了使用 MOHAWK(Bick 等人,2024 年)進行跨架構萃取的有效性,在訓練資料不到類似大小模型通常使用的 0.1% 的情況下實現了這些結果。為了充分利用其效率,我們為 Llamba 提供了針對資源受限裝置(例如智慧型手機和邊緣平台)的最佳化實作,提供實用且記憶體效率高的 Transformer 替代方案。總體而言,Llamba 改善了速度、記憶體效率和效能之間的權衡,讓高品質語言模型更易於取得。 -##### **SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?** -2502.13233v1 by Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu +##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** +2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu -Large Language Models (LLMs) have shown remarkable capabilities in general -domains but often struggle with tasks requiring specialized knowledge. -Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve -external information from static knowledge bases, which can be outdated or -incomplete, missing fine-grained clinical details essential for accurate -medical question answering. In this work, we propose SearchRAG, a novel -framework that overcomes these limitations by leveraging real-time search -engines. Our method employs synthetic query generation to convert complex -medical questions into search-engine-friendly queries and utilizes -uncertainty-based knowledge selection to filter and incorporate the most -relevant and informative medical knowledge into the LLM's input. Experimental -results demonstrate that our method significantly improves response accuracy in -medical question answering tasks, particularly for complex questions requiring -detailed and up-to-date knowledge. +To enhance tourists' experiences and immersion, this paper proposes a +narrative-driven travel planning framework called NarrativeGuide, which +generates a geoculturally-grounded narrative script for travelers, offering a +novel, role-playing experience for their journey. In the initial stage, +NarrativeGuide constructs a knowledge graph for attractions within a city, then +configures the worldview, character setting, and exposition based on the +knowledge graph. Using this foundation, the knowledge graph is combined to +generate an independent scene unit for each attraction. During the itinerary +planning stage, NarrativeGuide models narrative-driven travel planning as an +optimization problem, utilizing a genetic algorithm (GA) to refine the +itinerary. Before evaluating the candidate itinerary, transition scripts are +generated for each pair of adjacent attractions, which, along with the scene +units, form a complete script. The weighted sum of script coherence, travel +time, and attraction scores is then used as the fitness value to update the +candidate solution set. Experimental results across four cities, i.e., Nanjing +and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate +significant improvements in narrative coherence and cultural fit, alongside a +notable reduction in travel time and an increase in the quality of visited +attractions. Our study highlights that incorporating external evolutionary +optimization effectively addresses the limitations of large language models in +travel planning.Our codes are available at +https://github.com/Evan01225/Narrative-Driven-Travel-Planning. -摘要:大型語言模型 (LLM) 在一般領域展現出驚人的能力,但經常在需要專業知識的任務中掙扎。 -傳統的檢索增強生成 (RAG) 技術通常從靜態知識庫中檢索外部資訊,這些資訊可能過時或不完整,缺少準確回答醫療問題所需的細微臨床細節。在這項工作中,我們提出 SearchRAG,這是一種新穎的架構,透過利用即時搜尋引擎克服這些限制。我們的模型採用合成查詢生成,將複雜的醫療問題轉換成搜尋引擎友善的查詢,並利用基於不確定性的知識選擇來過濾和納入 LLM 輸入中最相關且最有資訊的醫療知識。實驗結果證明,我們的模型顯著改善了醫療問題回答任務中的回應準確度,特別是需要詳細且最新的知識的複雜問題。 +摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 -##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions** -2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić +##### **Optimal word order for non-causal text generation with Large Language Models: the Spanish case** +2502.14451v1 by Andrea Busto-Castiñeira, Silvia García-Méndez, Francisco de Arriba-Pérez, Francisco J. González-Castaño -We present an end-to-end framework for generating synthetic users for -evaluating interactive agents designed to encourage positive behavior changes, -such as in health and lifestyle coaching. The synthetic users are grounded in -health and lifestyle conditions, specifically sleep and diabetes management in -this study, to ensure realistic interactions with the health coaching agent. -Synthetic users are created in two stages: first, structured data are generated -grounded in real-world health and lifestyle factors in addition to basic -demographics and behavioral attributes; second, full profiles of the synthetic -users are developed conditioned on the structured data. Interactions between -synthetic users and the coaching agent are simulated using generative -agent-based models such as Concordia, or directly by prompting a language -model. Using two independently-developed agents for sleep and diabetes coaching -as case studies, the validity of this framework is demonstrated by analyzing -the coaching agent's understanding of the synthetic users' needs and -challenges. Finally, through multiple blinded evaluations of user-coach -interactions by human experts, we demonstrate that our synthetic users with -health and behavioral attributes more accurately portray real human users with -the same attributes, compared to generic synthetic users not grounded in such -attributes. The proposed framework lays the foundation for efficient -development of conversational agents through extensive, realistic, and grounded -simulated interactions. +Natural Language Generation (NLG) popularity has increased owing to the +progress in Large Language Models (LLMs), with zero-shot inference +capabilities. However, most neural systems utilize decoder-only causal +(unidirectional) transformer models, which are effective for English but may +reduce the richness of languages with less strict word order, subject omission, +or different relative clause attachment preferences. This is the first work +that analytically addresses optimal text generation order for non-causal +language models. We present a novel Viterbi algorithm-based methodology for +maximum likelihood word order estimation. We analyze the non-causal +most-likelihood order probability for NLG in Spanish and, then, the probability +of generating the same phrases with Spanish causal NLG. This comparative +analysis reveals that causal NLG prefers English-like SVO structures. We also +analyze the relationship between optimal generation order and causal +left-to-right generation order using Spearman's rank correlation. Our results +demonstrate that the ideal order predicted by the maximum likelihood estimator +is not closely related to the causal order and may be influenced by the +syntactic structure of the target sentence. -摘要:我們提供了一個端到端的架構,用於為評估互動式代理生成合成使用者,這些代理旨在鼓勵正向行為改變,例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎,特別是本研究中的睡眠和糖尿病管理,以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立:首先,除了基本人口統計資料和行為屬性外,還會產生以現實世界的健康和生活方式因素為基礎的結構化資料;其次,會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型(例如 Concordia)模擬的,或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究,通過分析指導代理對合成使用者需求和挑戰的理解,證明了此架構的有效性。最後,通過人類專家對使用者指導互動進行多重盲測評估,我們證明了與未以這些屬性為基礎的通用合成使用者相比,具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動,為對話代理的有效開發奠定了基礎。 +摘要:自然語言生成 (NLG) 的普及歸功於大型語言模型 (LLM) 的進步,以及零次學習推論能力。然而,大多數神經系統使用僅解碼器因果 (單向) Transformer模型,這對英語很有效,但可能會減少語序較不嚴謹、省略主詞或相對從句附加偏好不同的語言的豐富性。這是第一個針對非因果語言模型分析性地解決最佳文字生成順序的研究。我們提出了一種基於維特比演算法的新方法,用於最大似然詞序估計。我們分析了西班牙語 NLG 的非因果最大似然順序機率,然後分析了使用西班牙語因果 NLG 生成相同短語的機率。這種比較分析顯示,因果 NLG 偏好英語式的 SVO 結構。我們還使用 Spearman 等級相關性分析最佳生成順序和因果從左到右生成順序之間的關係。我們的結果表明,最大似然估計器預測的理想順序與因果順序沒有密切關係,並且可能會受到目標句子的語法結構影響。 -##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization** -2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar -Clinical Question Answering (CQA) plays a crucial role in medical -decision-making, enabling physicians to extract relevant information from -Electronic Medical Records (EMRs). While transformer-based models such as BERT, -BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in -CQA, existing models lack the ability to categorize extracted answers, which is -critical for structured retrieval, content filtering, and medical decision -support. - To address this limitation, we introduce a Multi-Task Learning (MTL) -framework that jointly trains CQA models for both answer extraction and medical -categorization. In addition to predicting answer spans, our model classifies -responses into five standardized medical categories: Diagnosis, Medication, -Symptoms, Procedure, and Lab Reports. This categorization enables more -structured and interpretable outputs, making clinical QA models more useful in -real-world healthcare settings. - We evaluate our approach on emrQA, a large-scale dataset for medical question -answering. Results show that MTL improves F1-score by 2.2% compared to standard -fine-tuning, while achieving 90.7% accuracy in answer categorization. These -findings suggest that MTL not only enhances CQA performance but also introduces -an effective mechanism for categorization and structured medical information -retrieval. +### Knowledge Graphs +|Publish Date|Title|Authors|Homepage|Code| +| :---: | :---: | :---: | :---: | :---: | +|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| +|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| +|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| +|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| +|**2025-02-20**|**Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment**|Jiaxi Li et.al.|[2502.14275v1](http://arxiv.org/abs/2502.14275v1)|null| +|**2025-02-20**|**Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering**|Rongzhi Zhu et.al.|[2502.14245v1](http://arxiv.org/abs/2502.14245v1)|null| +|**2025-02-20**|**NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM**|Jiayin Lan et.al.|[2502.14192v1](http://arxiv.org/abs/2502.14192v1)|null| +|**2025-02-19**|**Object-centric Binding in Contrastive Language-Image Pretraining**|Rim Assouel et.al.|[2502.14113v1](http://arxiv.org/abs/2502.14113v1)|null| +|**2025-02-19**|**Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning**|Cole Gawin et.al.|[2502.14086v1](http://arxiv.org/abs/2502.14086v1)|null| +|**2025-02-19**|**Neurosymbolic artificial intelligence via large language models and coherence-driven inference**|Steve Huntsman et.al.|[2502.13953v1](http://arxiv.org/abs/2502.13953v1)|null| +|**2025-02-19**|**Complex Ontology Matching with Large Language Model Embeddings**|Guilherme Sousa et.al.|[2502.13619v1](http://arxiv.org/abs/2502.13619v1)|null| +|**2025-02-19**|**Are Large Language Models In-Context Graph Learners?**|Jintang Li et.al.|[2502.13562v1](http://arxiv.org/abs/2502.13562v1)|null| +|**2025-02-19**|**Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs**|Yushi Feng et.al.|[2502.13555v1](http://arxiv.org/abs/2502.13555v1)|[link](https://github.com/ys-feng/DemoGraph)| +|**2025-02-19**|**PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference**|Burc Gokden et.al.|[2502.13502v1](http://arxiv.org/abs/2502.13502v1)|[link](https://github.com/burcgokden/PLDR-LLM-with-KVG-cache)| +|**2025-02-19**|**Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction**|Yanbang Sun et.al.|[2502.13412v1](http://arxiv.org/abs/2502.13412v1)|null| +|**2025-02-19**|**Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval**|Aditya Sharma et.al.|[2502.13369v1](http://arxiv.org/abs/2502.13369v1)|null| +|**2025-02-19**|**Craw4LLM: Efficient Web Crawling for LLM Pretraining**|Shi Yu et.al.|[2502.13347v1](http://arxiv.org/abs/2502.13347v1)|[link](https://github.com/cxcscmu/crawl4llm)| +|**2025-02-18**|**K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction**|Tassallah Abdullahi et.al.|[2502.13344v1](http://arxiv.org/abs/2502.13344v1)|[link](https://github.com/rsinghlab/K-Paths)| +|**2025-02-18**|**Grounding LLM Reasoning with Knowledge Graphs**|Alfonso Amayuelas et.al.|[2502.13247v1](http://arxiv.org/abs/2502.13247v1)|null| +|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null| +|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null| +|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null| +|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null| +|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null| +|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|[link](https://github.com/yuhan1i/g-refer)| +|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null| +|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null| +|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null| +|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|[link](https://github.com/acnlplab/umr-text-gen)| +|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null| +|**2025-02-17**|**Exploring LLM-based Student Simulation for Metacognitive Cultivation**|Haoxuan Li et.al.|[2502.11678v1](http://arxiv.org/abs/2502.11678v1)|null| +|**2025-02-17**|**Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering**|Runxuan Liu et.al.|[2502.11491v1](http://arxiv.org/abs/2502.11491v1)|null| +|**2025-02-17**|**GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion**|Kangyang Luo et.al.|[2502.11471v1](http://arxiv.org/abs/2502.11471v1)|null| +|**2025-02-16**|**Large Language-Geometry Model: When LLM meets Equivariance**|Zongzhao Li et.al.|[2502.11149v2](http://arxiv.org/abs/2502.11149v2)|null| +|**2025-02-16**|**Beyond Pairwise: Global Zero-shot Temporal Graph Generation**|Alon Eirew et.al.|[2502.11114v1](http://arxiv.org/abs/2502.11114v1)|null| +|**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|[link](https://github.com/alexlecu/llmkgraph)| +|**2025-02-16**|**Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection**|Yang Zhao et.al.|[2502.11062v1](http://arxiv.org/abs/2502.11062v1)|null| +|**2025-02-16**|**CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models**|Yuefei Chen et.al.|[2502.11008v1](http://arxiv.org/abs/2502.11008v1)|null| +|**2025-02-16**|**RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation**|Pengcheng Jiang et.al.|[2502.10996v1](http://arxiv.org/abs/2502.10996v1)|[link](https://github.com/pat-jj/Retrieval-And-Structure)| +|**2025-02-15**|**Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia**|Rohith Perumandla et.al.|[2502.10896v1](http://arxiv.org/abs/2502.10896v1)|null| +|**2025-02-15**|**Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)**|Sandra Schaftner et.al.|[2502.10768v1](http://arxiv.org/abs/2502.10768v1)|null| +|**2025-02-15**|**K-Edit: Language Model Editing with Contextual Knowledge Awareness**|Elan Markowitz et.al.|[2502.10626v1](http://arxiv.org/abs/2502.10626v1)|null| +|**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null| +|**2025-02-14**|**GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs**|Shima Khoshraftar et.al.|[2502.10522v1](http://arxiv.org/abs/2502.10522v1)|null| +|**2025-02-14**|**Do Large Language Models Reason Causally Like Us? Even Better?**|Hanna M. Dettki et.al.|[2502.10215v1](http://arxiv.org/abs/2502.10215v1)|null| +|**2025-02-14**|**Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages**|Daniil Gurgurov et.al.|[2502.10140v1](http://arxiv.org/abs/2502.10140v1)|null| +|**2025-02-14**|**Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models**|Chenrui Tie et.al.|[2502.10090v1](http://arxiv.org/abs/2502.10090v1)|null| +|**2025-02-14**|**Decision Information Meets Large Language Models: The Future of Explainable Operations Research**|Yansen Zhang et.al.|[2502.09994v1](http://arxiv.org/abs/2502.09994v1)|null| +|**2025-02-14**|**KGGen: Extracting Knowledge Graphs from Plain Text with Language Models**|Belinda Mo et.al.|[2502.09956v1](http://arxiv.org/abs/2502.09956v1)|null| +|**2025-02-14**|**ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation**|Shu Wang et.al.|[2502.09891v1](http://arxiv.org/abs/2502.09891v1)|null| +|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null| +|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null| +|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null| +|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v3](http://arxiv.org/abs/2502.08346v3)|null| +|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null| +|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null| +|**2025-02-12**|**LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search**|Yang Gao et.al.|[2502.10459v1](http://arxiv.org/abs/2502.10459v1)|null| +|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null| +|**2025-02-12**|**Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality**|Xin Kang et.al.|[2502.09658v1](http://arxiv.org/abs/2502.09658v1)|null| +|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null| +|**2025-02-12**|**Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach**|Régnier Avice et.al.|[2502.10453v1](http://arxiv.org/abs/2502.10453v1)|[link](https://github.com/ravice234/cryptoasset-attribution-tag-linker)| +|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null| +|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null| +|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)| +|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null| +|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|[link](https://github.com/YuxingLu613/KARMA)| +|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null| +|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null| +|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null| +|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null| +|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null| +|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null| +|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null| +|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null| +|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)| +|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null| +|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|[link](https://github.com/Infini-AI-Lab/gsm_infinite)| +|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null| +|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)| +|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null| +|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)| +|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)| +|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null| +|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)| +|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)| +|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null| +|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null| +|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null| +|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null| +|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null| +|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null| +|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null| +|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null| +|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null| +|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null| +|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null| +|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)| +|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null| +|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null| +|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)| + +#### Abstracts +##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** +2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu + +Large Language Models (LLMs) have shown great promise in tool-making, yet +existing frameworks often struggle to efficiently construct reliable toolsets +and are limited to single-task settings. To address these challenges, we +propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that +dynamically constructs and evolves a hierarchical graph of reusable tools +across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), +agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, +TabMWP). Our results show that GATE achieves up to 4.3x faster milestone +completion in Minecraft compared to the previous SOTA, and provides an average +improvement of 9.23% over existing tool-making methods in code generation tasks +and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, +balancing tool quantity, complexity, and functionality while maintaining high +efficiency. Code and data are available at +\url{https://github.com/ayanami2003/GATE}. -摘要:臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色,讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能,但現有的模型缺乏分類擷取答案的能力,這對於結構化檢索、內容過濾和醫療決策支援至關重要。 - 為了解決這個限制,我們引進了一個多任務學習 (MTL) 架構,它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍,我們的模型將回應分類為五個標準化醫療類別:診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出,讓臨床問答模型在真實世界的醫療保健環境中更實用。 - 我們在 emrQA 上評估我們的做法,emrQA 是用於醫療問題解答的大規模資料集。結果顯示,與標準微調相比,MTL 將 F1 分數提高了 2.2%,同時在答案分類中達到 90.7% 的準確度。這些發現表明,MTL 不僅增強了 CQA 的效能,還引入了一種分類和結構化醫療資訊檢索的有效機制。 +摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 -##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection** -2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert +##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** +2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su -Detection of hyperenhancement from cardiac LGE MRI images is a complex task -requiring significant clinical expertise. Although deep learning-based models -have shown promising results for the task, they require large amounts of data -with fine-grained annotations. Clinical reports generated for cardiac MR -studies contain rich, clinically relevant information, including the location, -extent and etiology of any scars present. Although recently developed -CLIP-based training enables pretraining models with image-text pairs, it -requires large amounts of data and further finetuning strategies on downstream -tasks. In this study, we use various strategies rooted in domain knowledge to -train a model for LGE detection solely using text from clinical reports, on a -relatively small clinical cohort of 965 patients. We improve performance -through the use of synthetic data augmentation, by systematically creating scar -images and associated text. In addition, we standardize the orientation of the -images in an anatomy-informed way to enable better alignment of spatial and -text features. We also use a captioning loss to enable fine-grained supervision -and explore the effect of pretraining of the vision encoder on performance. -Finally, ablation studies are carried out to elucidate the contributions of -each design component to the overall performance of the model. +Our ability to continuously acquire, organize, and leverage knowledge is a +key feature of human intelligence that AI systems must approximate to unlock +their full potential. Given the challenges in continual learning with large +language models (LLMs), retrieval-augmented generation (RAG) has become the +dominant way to introduce new information. However, its reliance on vector +retrieval hinders its ability to mimic the dynamic and interconnected nature of +human long-term memory. Recent RAG approaches augment vector embeddings with +various structures like knowledge graphs to address some of these gaps, namely +sense-making and associativity. However, their performance on more basic +factual memory tasks drops considerably below standard RAG. We address this +unintended deterioration and propose HippoRAG 2, a framework that outperforms +standard RAG comprehensively on factual, sense-making, and associative memory +tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in +HippoRAG and enhances it with deeper passage integration and more effective +online use of an LLM. This combination pushes this RAG system closer to the +effectiveness of human long-term memory, achieving a 7% improvement in +associative memory tasks over the state-of-the-art embedding model while also +exhibiting superior factual knowledge and sense-making memory capabilities. +This work paves the way for non-parametric continual learning for LLMs. Our +code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. -摘要:從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務,需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果,但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊,包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型,但它需要大量資料和進一步微調下游任務的策略。在這項研究中,我們使用植基於領域知識的各種策略,僅使用來自臨床報告的文字,在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能,系統性地建立疤痕影像和相關文字。此外,我們以解剖學告知的方式標準化影像方向,以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督,並探討視覺編碼器的預訓練對效能的影響。最後,進行消融研究以闡明每個設計元件對模型整體效能的貢獻。 +摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 -##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models** -2502.12825v2 by Rubing Li, João Sedoc, Arun Sundararajan +##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** +2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao -When encountering increasingly frequent performance improvements or cost -reductions from a new large language model (LLM), developers of applications -leveraging LLMs must decide whether to take advantage of these improvements or -stay with older tried-and-tested models. Low perceived switching frictions can -lead to choices that do not consider more subtle behavior changes that the -transition may induce. Our experiments use a popular game-theoretic behavioral -economics model of trust to show stark differences in the trusting behavior of -OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust -behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing -and risk-seeking with future returns from trust, and contrast it with -DeepSeek's more sophisticated and profitable trusting behavior that stems from -an ability to incorporate deeper concepts like forward planning and -theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our -results highlight the perils of relying on LLM performance benchmarks that are -too narrowly defined and suggest that careful analysis of their hidden fault -lines should be part of any organization's AI strategy. +Large Language Models (LLMs) have demonstrated exceptional abilities in +reasoning for task planning. However, challenges remain under-explored for +parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in +which the model first decomposes a real-life textual task into executable +subtasks and constructs an abstract task graph. The model then understands this +task graph as input and generates a plan for parallel execution. To enhance the +planning capability of complex, scalable graphs, we design an automated and +controllable pipeline to generate synthetic graphs and propose a two-stage +training scheme. Experimental results show that our plan-over-graph method +significantly improves task performance on both API-based LLMs and trainable +open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally +supports parallel execution, demonstrating global efficiency. The code and data +are available at https://github.com/zsq259/Plan-over-Graph. -摘要:在遇到大型語言模型 (LLM) 頻頻帶來的效能提升或成本降低時,利用 LLM 的應用程式開發人員必須決定是否要利用這些提升,或繼續使用較舊且經過驗證的模型。低感知切換摩擦可能會導致選擇,而沒有考慮轉換可能引發的更細微行為變更。我們的實驗使用流行的博弈論行為經濟信任模型,以顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰,因為它們調和了利潤最大化和冒險,以及來自信任的未來回報,並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比,這種行為源於整合更深入的概念,例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎,我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險,並建議仔細分析其隱藏的斷層線應該是任何組織 AI 策略的一部分。 +摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 -##### **LLM Safety for Children** -2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat +##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** +2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu -This paper analyzes the safety of Large Language Models (LLMs) in -interactions with children below age of 18 years. Despite the transformative -applications of LLMs in various aspects of children's lives such as education -and therapy, there remains a significant gap in understanding and mitigating -potential content harms specific to this demographic. The study acknowledges -the diverse nature of children often overlooked by standard safety evaluations -and proposes a comprehensive approach to evaluating LLM safety specifically for -children. We list down potential risks that children may encounter when using -LLM powered applications. Additionally we develop Child User Models that -reflect the varied personalities and interests of children informed by -literature in child care and psychology. These user models aim to bridge the -existing gap in child safety literature across various fields. We utilize Child -User Models to evaluate the safety of six state of the art LLMs. Our -observations reveal significant safety gaps in LLMs particularly in categories -harmful to children but not adults +To enhance tourists' experiences and immersion, this paper proposes a +narrative-driven travel planning framework called NarrativeGuide, which +generates a geoculturally-grounded narrative script for travelers, offering a +novel, role-playing experience for their journey. In the initial stage, +NarrativeGuide constructs a knowledge graph for attractions within a city, then +configures the worldview, character setting, and exposition based on the +knowledge graph. Using this foundation, the knowledge graph is combined to +generate an independent scene unit for each attraction. During the itinerary +planning stage, NarrativeGuide models narrative-driven travel planning as an +optimization problem, utilizing a genetic algorithm (GA) to refine the +itinerary. Before evaluating the candidate itinerary, transition scripts are +generated for each pair of adjacent attractions, which, along with the scene +units, form a complete script. The weighted sum of script coherence, travel +time, and attraction scores is then used as the fitness value to update the +candidate solution set. Experimental results across four cities, i.e., Nanjing +and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate +significant improvements in narrative coherence and cultural fit, alongside a +notable reduction in travel time and an increase in the quality of visited +attractions. Our study highlights that incorporating external evolutionary +optimization effectively addresses the limitations of large language models in +travel planning.Our codes are available at +https://github.com/Evan01225/Narrative-Driven-Travel-Planning. -摘要:本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面(例如教育和治療)都有轉變性的應用,但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性,而標準安全評估通常會忽略這些多樣性,並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外,我們開發了兒童使用者模型,這些模型反映了兒童不同的個性特質和興趣,並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞,特別是在對兒童有害但對成年人無害的類別中 +摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 -##### **Classifiers of Data Sharing Statements in Clinical Trial Records** -2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth +##### **Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment** +2502.14275v1 by Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu -Digital individual participant data (IPD) from clinical trials are -increasingly distributed for potential scientific reuse. The identification of -available IPD, however, requires interpretations of textual data-sharing -statements (DSS) in large databases. Recent advancements in computational -linguistics include pre-trained language models that promise to simplify the -implementation of effective classifiers based on textual inputs. In a subset of -5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers -based on domain-specific pre-trained language models reproduce original -availability categories as well as manually annotated labels. Typical metrics -indicate that classifiers that predicted manual annotations outperformed those -that learned to output the original availability categories. This suggests that -the textual DSS descriptions contain applicable information that the -availability categories do not, and that such classifiers could thus aid the -automatic identification of available IPD in large trial databases. +Large language models (LLMs) have been widely adopted in various downstream +task domains. However, their ability to directly recall and apply factual +medical knowledge remains under-explored. Most existing medical QA benchmarks +assess complex reasoning or multi-hop inference, making it difficult to isolate +LLMs' inherent medical knowledge from their reasoning capabilities. Given the +high-stakes nature of medical applications, where incorrect information can +have critical consequences, it is essential to evaluate how well LLMs encode, +retain, and recall fundamental medical facts. + To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset +specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ +is constructed from the Unified Medical Language System (UMLS), a large-scale +repository of standardized biomedical vocabularies and knowledge graphs. We +frame knowledge assessment as a binary judgment task, requiring LLMs to verify +the correctness of medical statements extracted from reliable and structured +knowledge sources. + Our experiments reveal that LLMs struggle with factual medical knowledge +retention, exhibiting significant performance variance across different +semantic categories, particularly for rare medical conditions. Furthermore, +LLMs show poor calibration, often being overconfident in incorrect answers. To +mitigate these issues, we explore retrieval-augmented generation, demonstrating +its effectiveness in improving factual accuracy and reducing uncertainty in +medical decision-making. -摘要:臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而,要找出可用的 IPD,需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型,有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中,我們評估了基於特定領域預先訓練語言模型的分類器,在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示,預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊,而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。 +摘要:大型語言模型 (LLM) 已廣泛應用於各種下游 +任務領域。然而,它們直接回憶和應用事實 +醫學知識的能力仍未得到充分探索。大多數現有的醫療問答基準 +評估複雜推理或多跳躍推論,這使得難以將 +LLM 內在的醫學知識從其推理能力中分離出來。鑑於 +醫療應用具有高風險,其中不正確的資訊可能會 +造成嚴重後果,因此評估 LLM 編碼、 +保留和回憶基本醫學事實的能力至關重要。 +為了彌合這一差距,我們引入了醫學知識判斷,這是一個專門設計用於測量 LLM 的一跳事實醫學知識的數據集。MKJ +是由統一醫學語言系統 (UMLS) 構建的,UMLS 是標準化生物醫學詞彙和知識圖譜的大型庫。我們 +將知識評估構建為二元判斷任務,要求 LLM 驗證從可靠且結構化的 +知識來源中提取的醫學陳述的正確性。 +我們的實驗表明,LLM 難以保留事實醫學知識,在不同的 +語義類別中表現出顯著的性能差異,特別是對於罕見的醫療狀況。此外, +LLM 表現出校準不佳,通常對不正確的答案過於自信。為了 +減輕這些問題,我們探索了檢索增強生成,證明了其在提高事實準確性和降低不確定性方面的有效性 +在醫療決策制定中。 -##### **Relational Norms for Human-AI Cooperation** -2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark +##### **Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering** +2502.14245v1 by Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu -How we should design and interact with social artificial intelligence depends -on the socio-relational role the AI is meant to emulate or occupy. In human -society, relationships such as teacher-student, parent-child, neighbors, -siblings, or employer-employee are governed by specific norms that prescribe or -proscribe cooperative functions including hierarchy, care, transaction, and -mating. These norms shape our judgments of what is appropriate for each -partner. For example, workplace norms may allow a boss to give orders to an -employee, but not vice versa, reflecting hierarchical and transactional -expectations. As AI agents and chatbots powered by large language models are -increasingly designed to serve roles analogous to human positions - such as -assistant, mental health provider, tutor, or romantic partner - it is -imperative to examine whether and how human relational norms should extend to -human-AI interactions. Our analysis explores how differences between AI systems -and humans, such as the absence of conscious experience and immunity to -fatigue, may affect an AI's capacity to fulfill relationship-specific functions -and adhere to corresponding norms. This analysis, which is a collaborative -effort by philosophers, psychologists, relationship scientists, ethicists, -legal experts, and AI researchers, carries important implications for AI -systems design, user behavior, and regulation. While we accept that AI systems -can offer significant benefits such as increased availability and consistency -in certain socio-relational roles, they also risk fostering unhealthy -dependencies or unrealistic expectations that could spill over into human-human -relationships. We propose that understanding and thoughtfully shaping (or -implementing) suitable human-AI relational norms will be crucial for ensuring -that human-AI interactions are ethical, trustworthy, and favorable to human -well-being. +In this paper, we identify a critical problem, "lost-in-retrieval", in +retrieval-augmented multi-hop question answering (QA): the key entities are +missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly +degrades the retrieval performance, which disrupts the reasoning chain and +leads to the incorrect answers. To resolve this problem, we propose a +progressive retrieval and rewriting method, namely ChainRAG, which sequentially +handles each sub-question by completing missing key entities and retrieving +relevant sentences from a sentence graph for answer generation. Each step in +our retrieval and rewriting process builds upon the previous one, creating a +seamless chain that leads to accurate retrieval and answers. Finally, all +retrieved sentences and sub-question answers are integrated to generate a +comprehensive answer to the original question. We evaluate ChainRAG on three +multi-hop QA datasets$\unicode{x2013}$MuSiQue, 2Wiki, and +HotpotQA$\unicode{x2013}$using three large language models: GPT4o-mini, +Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG +consistently outperforms baselines in both effectiveness and efficiency. + +摘要:在本文中,我們在檢索增強的多跳問答 (QA) 中發現了一個關鍵問題「檢索中遺失」,關鍵實體遺失在 LLM 的子問題分解中。「檢索中遺失」顯著降低檢索效能,這會中斷推理鏈並導致錯誤的答案。為了解決此問題,我們提出了一種漸進式檢索和重寫方法,即 ChainRAG,它通過完成遺失的關鍵實體並從句子圖中檢索相關句子來順序處理每個子問題以產生答案。我們檢索和重寫過程中每一步都建立在前一步之上,創造了一個無縫的鏈,導致準確的檢索和答案。最後,所有檢索到的句子和子問題答案都整合起來,以產生對原始問題的全面答案。我們在三個多跳問答資料集$\unicode{x2013}$MuSiQue、2Wiki 和 HotpotQA$\unicode{x2013}$上評估 ChainRAG,使用三個大型語言模型:GPT4o-mini、Qwen2.5-72B 和 GLM-4-Plus。實證結果表明,ChainRAG 在有效性和效率方面都持續優於基準。 + +##### **NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM** +2502.14192v1 by Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin + +Large language models (LLMs) have been widely applied in question answering +over scientific research papers. To enhance the professionalism and accuracy of +responses, many studies employ external knowledge augmentation. However, +existing structures of external knowledge in scientific literature often focus +solely on either paper entities or domain concepts, neglecting the intrinsic +connections between papers through shared domain concepts. This results in less +comprehensive and specific answers when addressing questions that combine +papers and concepts. To address this, we propose a novel knowledge graph +framework that captures deep conceptual relations between academic papers, +constructing a relational network via intra-paper semantic elements and +inter-paper citation relations. Using a few-shot knowledge graph construction +method based on LLM, we develop NLP-AKG, an academic knowledge graph for the +NLP domain, by extracting 620,353 entities and 2,271,584 relations from 60,826 +papers in ACL Anthology. Based on this, we propose a 'sub-graph community +summary' method and validate its effectiveness on three NLP scientific +literature question answering datasets. -摘要:我們應如何設計和與社交人工智慧互動,取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中,師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配,這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如,職場規範可能允許老闆對員工發號施令,但反之則不行,這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色,例如助理、心理健康提供者、導師或浪漫伴侶,審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異,例如缺乏意識體驗和對疲勞的免疫力,如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果,對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處,例如增加可用性和一致性,但它們也可能助長不健康的依賴關係或不切實際的期望,這些期望可能會蔓延到人際關係中。我們提出,理解和深思熟慮地塑造(或實施)適當的人類與人工智慧關係規範,對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。 +摘要:大型语言模型 (LLM) 已广泛应用于科学研究论文的问答中。为了提高响应的专业性和准确性,许多研究采用外部知识增强。然而,科学文献中现有外部知识的结构通常仅关注论文实体或领域概念,而忽略了论文之间通过共享领域概念而形成的内在联系。这导致在解决结合论文和概念的问题时,答案不够全面和具体。为了解决这个问题,我们提出了一种新颖的知识图谱框架,该框架捕获了学术论文之间的深层概念关系,通过论文内部语义元素和论文之间的引用关系构建关系网络。我们使用基于 LLM 的少量知识图谱构建方法,从 ACL Anthology 中的 60,826 篇论文中提取了 620,353 个实体和 2,271,584 个关系,开发了 NLP 领域的学术知识图谱 NLP-AKG。在此基础上,我们提出了一种“子图社区摘要”方法,并在三个 NLP 科学文献问答数据集上验证了其有效性。 -##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis** -2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy +##### **Object-centric Binding in Contrastive Language-Image Pretraining** +2502.14113v1 by Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano -Air quality prediction is key to mitigating health impacts and guiding -decisions, yet existing models tend to focus on temporal trends while -overlooking spatial generalization. We propose AQ-Net, a spatiotemporal -reanalysis model for both observed and unobserved stations in the near future. -AQ-Net utilizes the LSTM and multi-head attention for the temporal regression. -We also propose a cyclic encoding technique to ensure continuous time -representation. To learn fine-grained spatial air quality estimation, we -incorporate AQ-Net with the neural kNN to explore feature-based interpolation, -such that we can fill the spatial gaps given coarse observation stations. To -demonstrate the efficiency of our model for spatiotemporal reanalysis, we use -data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive -experiments show that AQ-Net excels in air quality reanalysis, highlighting the -potential of hybrid spatio-temporal models to better capture environmental -dynamics, especially in urban areas where both spatial and temporal variability -are critical. +Recent advances in vision language models (VLM) have been driven by +contrastive models such as CLIP, which learn to associate visual information +with their corresponding text descriptions. However, these models have +limitations in understanding complex compositional scenes involving multiple +objects and their spatial relationships. To address these challenges, we +propose a novel approach that diverges from commonly used strategies, which +rely on the design of hard-negative augmentations. Instead, our work focuses on +integrating inductive biases into pre-trained CLIP-like models to improve their +compositional understanding without using any additional hard-negatives. To +that end, we introduce a binding module that connects a scene graph, derived +from a text description, with a slot-structured image representation, +facilitating a structured similarity assessment between the two modalities. We +also leverage relationships as text-conditioned visual constraints, thereby +capturing the intricate interactions between objects and their contextual +relationships more effectively. Our resulting model not only enhances the +performance of CLIP-based models in multi-object compositional understanding +but also paves the way towards more accurate and sample-efficient image-text +matching of complex scenes. -摘要:空气品质预测是减轻健康影响和指导决策的关键,但现有的模型倾向于关注时间趋势,而忽略空间概化。我们提出了 AQ-Net,这是一种时空再分析模型,适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计,我们将 AQ-Net 与神经 kNN 结合起来,以探索基于特征的插值,以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率,我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明,AQ-Net 在空气质量再分析中表现出色,突出了混合时空模型在更好地捕捉环境动态方面的潜力,尤其是在空间和时间变异性都很关键的城市地区。 +摘要:最近视觉语言模型 (VLM) 的进步是由对比模型(例如 CLIP)推动的,该模型学习将视觉信息与其对应的文本描述联系起来。然而,这些模型在理解涉及多个对象及其空间关系的复杂组合场景方面存在局限性。为了应对这些挑战,我们提出了一种新颖的方法,它偏离了常用的策略,即依赖于硬负增强设计。相反,我们的工作重点是将归纳偏差集成到预训练的类似 CLIP 的模型中,以提高其组合理解能力,而无需使用任何其他硬否定。为此,我们引入了一个绑定模块,它将从文本描述中派生的场景图与槽结构图像表示连接起来,从而促进了两种模式之间的结构化相似性评估。我们还利用关系作为文本条件的视觉约束,从而更有效地捕捉对象及其上下文关系之间的复杂交互。我们由此产生的模型不仅增强了基于 CLIP 的模型在多对象组合理解中的性能,而且还为复杂场景的更准确和样本高效的图像文本匹配铺平了道路。 -##### **Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing** -2502.11715v1 by Site Qu, Guoqiang Hu +##### **Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning** +2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal -The Location-Routing Problem (LRP), which combines the challenges of facility -(depot) locating and vehicle route planning, is critically constrained by the -reliance on predefined depot candidates, limiting the solution space and -potentially leading to suboptimal outcomes. Previous research on LRP without -predefined depots is scant and predominantly relies on heuristic algorithms -that iteratively attempt depot placements across a planar area. Such approaches -lack the ability to proactively generate depot locations that meet specific -geographic requirements, revealing a notable gap in current research landscape. -To bridge this gap, we propose a data-driven generative DRL framework, designed -to proactively generate depots for LRP without predefined depot candidates, -solely based on customer requests data which include geographic and demand -information. It can operate in two distinct modes: direct generation of exact -depot locations, and the creation of a multivariate Gaussian distribution for -flexible depots sampling. By extracting depots' geographic pattern from -customer requests data, our approach can dynamically respond to logistical -needs, identifying high-quality depot locations that further reduce total -routing costs compared to traditional methods. Extensive experiments -demonstrate that, for a same group of customer requests, compared with those -depots identified through random attempts, our framework can proactively -generate depots that lead to superior solution routes with lower routing cost. -The implications of our framework potentially extend into real-world -applications, particularly in emergency medical rescue and disaster relief -logistics, where rapid establishment and adjustment of depot locations are -paramount, showcasing its potential in addressing LRP for dynamic and -unpredictable environments. +Large language models (LLMs) have achieved remarkable performance in +generating human-like text and solving reasoning tasks of moderate complexity, +such as question-answering and mathematical problem-solving. However, their +capabilities in tasks requiring deeper cognitive skills, such as common-sense +understanding and abstract reasoning, remain under-explored. In this paper, we +systematically evaluate abstract common-sense reasoning in LLMs using the +ConceptNet knowledge graph. We propose two prompting approaches: instruct +prompting, where models predict plausible semantic relationships based on +provided definitions, and few-shot prompting, where models identify relations +using examples as guidance. Our experiments with the gpt-4o-mini model show +that in instruct prompting, consistent performance is obtained when ranking +multiple relations but with substantial decline when the model is restricted to +predicting only one relation. In few-shot prompting, the model's accuracy +improves significantly when selecting from five relations rather than the full +set, although with notable bias toward certain relations. These results suggest +significant gaps still, even in commercially used LLMs' abstract common-sense +reasoning abilities, compared to human-level understanding. However, the +findings also highlight the promise of careful prompt engineering, based on +selective retrieval, for obtaining better performance. -摘要:地點路線問題(LRP)結合了設施(倉庫)定位和車輛路線規劃的挑戰,嚴重受到預先定義的倉庫候選限制,限制了解決方案空間,並可能導致次優結果。先前關於沒有預先定義倉庫的 LRP 研究很少,而且主要依賴於啟發式演算法,在平面區域中反覆嘗試倉庫配置。這種方法無法主動產生符合特定地理需求的倉庫位置,顯示了當前研究領域的顯著差距。為了彌補這個差距,我們提出一個資料驅動的生成式 DRL 架構,旨在主動為 LRP 產生倉庫,而無需預先定義的倉庫候選,僅根據包含地理和需求資訊的客戶要求資料。它可以在兩種不同的模式下運作:直接產生確切的倉庫位置,以及建立多元高斯分布以進行彈性倉庫抽樣。透過從客戶要求資料中提取倉庫的地理模式,我們的方法可以動態回應後勤需求,找出高品質的倉庫位置,進一步降低與傳統方法相比的總路線成本。廣泛的實驗證明,對於同一組客戶要求,與透過隨機嘗試識別的那些倉庫相比,我們的架構可以主動產生倉庫,並產生路線成本較低的優質解決方案路線。我們的架構的影響潛在地擴展到實際應用,特別是在緊急醫療救援和災害救災後勤方面,其中倉庫位置的快速建立和調整至關重要,展示了其在解決動態和不可預測環境的 LRP 中的潛力。 +摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。 -##### **LLM Agents Making Agent Tools** -2502.11705v1 by Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather +##### **Neurosymbolic artificial intelligence via large language models and coherence-driven inference** +2502.13953v1 by Steve Huntsman, Jewell Thomas -Tool use has turned large language models (LLMs) into powerful agents that -can perform complex multi-step tasks by dynamically utilising external software -components. However, these tools must be implemented in advance by human -developers, hindering the applicability of LLM agents in domains which demand -large numbers of highly specialised tools, like in life sciences and medicine. -Motivated by the growing trend of scientific studies accompanied by public code -repositories, we propose ToolMaker, a novel agentic framework that autonomously -transforms papers with code into LLM-compatible tools. Given a short task -description and a repository URL, ToolMaker autonomously installs required -dependencies and generates code to perform the task, using a closed-loop -self-correction mechanism to iteratively diagnose and rectify errors. To -evaluate our approach, we introduce a benchmark comprising 15 diverse and -complex computational tasks spanning both medical and non-medical domains with -over 100 unit tests to objectively assess tool correctness and robustness. -ToolMaker correctly implements 80% of the tasks, substantially outperforming -current state-of-the-art software engineering agents. ToolMaker therefore is a -step towards fully autonomous agent-based scientific workflows. +We devise an algorithm to generate sets of propositions that objectively +instantiate graphs that support coherence-driven inference. We then benchmark +the ability of large language models (LLMs) to reconstruct coherence graphs +from (a straightforward transformation of) propositions expressed in natural +language, with promising results from a single prompt to models optimized for +reasoning. Combining coherence-driven inference with consistency evaluations by +neural models may advance the state of the art in machine cognition. -摘要:工具使用已將大型語言模型 (LLM) 轉變為強大的代理,可透過動態使用外部軟體元件來執行複雜的多步驟任務。然而,這些工具必須事先由人類開發人員實作,這會阻礙 LLM 代理在需要大量高度專業化工具的領域(例如生命科學和醫學)中的應用性。受到伴隨公開程式碼儲存庫的科學研究趨勢所啟發,我們提出 ToolMaker,一個創新的代理架構,可自主地將帶有程式碼的論文轉換為相容於 LLM 的工具。給定簡短的任務描述和儲存庫網址,ToolMaker 會自主安裝所需的依賴項,並產生程式碼來執行任務,使用閉環自我修正機制來反覆診斷和糾正錯誤。為了評估我們的做法,我們引進一個包含 15 個不同且複雜的運算任務的基準,涵蓋醫療和非醫療領域,並包含超過 100 個單元測試,以客觀評估工具的正確性和穩健性。ToolMaker 正確實作了 80% 的任務,大幅優於目前的最新軟體工程代理。因此,ToolMaker 是邁向完全自主的基於代理的科學工作流程的一步。 +摘要:我們設計一種演算法,用來產生命題集合,以客觀地實例化支援連貫性驅動推論的圖形。接著,我們基準化大型語言模型 (LLM) 從以自然語言表達的命題(經過直接轉換)重建連貫性圖形的能力,結果顯示,單一提示就能從最佳化用於推理的模型中獲得有希望的結果。將連貫性驅動推論與神經模型的一致性評估結合起來,可能會提升機器認知的現有技術。 -##### **MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression** -2502.11651v1 by Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang +##### **Complex Ontology Matching with Large Language Model Embeddings** +2502.13619v1 by Guilherme Sousa, Rinaldo Lima, Cassia Trojahn -Large vision-language models (LVLMs) have shown great promise in medical -applications, particularly in visual question answering (MedVQA) and diagnosis -from medical images. However, existing datasets and models often fail to -consider critical aspects of medical diagnostics, such as the integration of -historical records and the analysis of disease progression over time. In this -paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel -dataset for MedVQA that focuses on identifying changes in specific regions -between two patient visits. Unlike previous datasets that primarily address -single-image questions, MMXU enables multi-image questions, incorporating both -current and historical patient data. We demonstrate the limitations of current -LVLMs in identifying disease progression on MMXU-\textit{test}, even those that -perform well on traditional benchmarks. To address this, we propose a -MedRecord-Augmented Generation (MAG) approach, incorporating both global and -regional historical records. Our experiments show that integrating historical -records significantly enhances diagnostic accuracy by at least 20\%, bridging -the gap between current LVLMs and human expert performance. Additionally, we -fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable -improvements. We hope this work could illuminate the avenue of advancing the -use of LVLMs in medical diagnostics by emphasizing the importance of historical -context in interpreting medical images. Our dataset is released at -\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU}. +Ontology, and more broadly, Knowledge Graph Matching is a challenging task in +which expressiveness has not been fully addressed. Despite the increasing use +of embeddings and language models for this task, approaches for generating +expressive correspondences still do not take full advantage of these models, in +particular, large language models (LLMs). This paper proposes to integrate LLMs +into an approach for generating expressive correspondences based on alignment +need and ABox-based relation discovery. The generation of correspondences is +performed by matching similar surroundings of instance sub-graphs. The +integration of LLMs results in different architectural modifications, including +label similarity, sub-graph matching, and entity matching. The performance word +embeddings, sentence embeddings, and LLM-based embeddings, was compared. The +results demonstrate that integrating LLMs surpasses all other models, enhancing +the baseline version of the approach with a 45\% increase in F-measure. -摘要:大型視覺語言模型 (LVLMs) 已在醫療應用中展現出極大的潛力,特別是在視覺問答 (MedVQA) 和醫學影像診斷方面。然而,現有的資料集和模型常常無法考量醫療診斷的關鍵層面,例如病歷整合以及隨著時間推移對疾病進程的分析。在本文中,我們介紹 MMXU(多模態多 X 光理解),一個專注於識別兩次患者就診之間特定區域變化的 MedVQA 新資料集。與主要處理單一影像問題的先前資料集不同,MMXU 支援多影像問題,同時納入當前和病史患者資料。我們展示了現有 LVLMs 在 MMXU-\textit{test} 中識別疾病進程的限制,即使是在傳統基準測試中表現良好的 LVLMs 也是如此。為了解決這個問題,我們提出了一個病歷增強生成 (MAG) 方法,結合了全域和區域病史。我們的實驗顯示,整合病歷可顯著提升至少 20% 的診斷準確度,縮小了現有 LVLMs 和人類專家表現之間的差距。此外,我們在 MMXU-\textit{dev} 上微調帶有 MAG 的模型,這展示了顯著的進步。我們希望這項工作能透過強調病史脈絡在解讀醫學影像中的重要性,為推進 LVLMs 在醫療診斷中的應用開闢道路。我們的資料集已於\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU} 發布。 +摘要:本体论,更广泛地说,知识图谱匹配是一项具有挑战性的任务,其中表达力尚未得到充分解决。尽管越来越多地使用嵌入和语言模型来完成此任务,但生成表达性对应关系的方法仍然没有充分利用这些模型,特别是大型语言模型 (LLM)。本文提出将 LLM 集成到一种基于对齐需求和基于 ABox 的关系发现来生成表达性对应关系的方法中。对应关系的生成是通过匹配实例子图的相似周围环境来执行的。LLM 的集成导致了不同的架构修改,包括标签相似性、子图匹配和实体匹配。比较了单词嵌入、句子嵌入和基于 LLM 的嵌入的性能。结果表明,集成 LLM 超越了所有其他模型,通过 F-measure 提高了 45% 的基准版本的方法。 -##### **A Survey of Personalized Large Language Models: Progress and Future Directions** -2502.11528v1 by Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Jieming Zhu, Minda Hu, Menglin Yang, Irwin King +##### **Are Large Language Models In-Context Graph Learners?** +2502.13562v1 by Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Liang Chen, Zibin Zheng -Large Language Models (LLMs) excel in handling general knowledge tasks, yet -they struggle with user-specific personalization, such as understanding -individual emotions, writing styles, and preferences. Personalized Large -Language Models (PLLMs) tackle these challenges by leveraging individual user -data, such as user profiles, historical dialogues, content, and interactions, -to deliver responses that are contextually relevant and tailored to each user's -specific needs. This is a highly valuable research topic, as PLLMs can -significantly enhance user satisfaction and have broad applications in -conversational agents, recommendation systems, emotion recognition, medical -assistants, and more. This survey reviews recent advancements in PLLMs from -three technical perspectives: prompting for personalized context (input level), -finetuning for personalized adapters (model level), and alignment for -personalized preferences (objective level). To provide deeper insights, we also -discuss current limitations and outline several promising directions for future -research. Updated information about this survey can be found at the -https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models. +Large language models (LLMs) have demonstrated remarkable in-context +reasoning capabilities across a wide range of tasks, particularly with +unstructured inputs such as language or images. However, LLMs struggle to +handle structured data, such as graphs, due to their lack of understanding of +non-Euclidean structures. As a result, without additional fine-tuning, their +performance significantly lags behind that of graph neural networks (GNNs) in +graph learning tasks. In this paper, we show that learning on graph data can be +conceptualized as a retrieval-augmented generation (RAG) process, where +specific instances (e.g., nodes or edges) act as queries, and the graph itself +serves as the retrieved context. Building on this insight, we propose a series +of RAG frameworks to enhance the in-context learning capabilities of LLMs for +graph learning tasks. Comprehensive evaluations demonstrate that our proposed +RAG frameworks significantly improve LLM performance on graph-based tasks, +particularly in scenarios where a pretrained LLM must be used without +modification or accessed via an API. -摘要:大型語言模型 (LLM) 在處理一般知識任務方面表現出色,但 -它們在使用者特定的個人化方面有困難,例如理解 -個別的情緒、寫作風格和偏好。個人化大型 -語言模型 (PLLM) 透過利用個別使用者的 -資料來解決這些挑戰,例如使用者個人資料、歷史對話、內容和互動, -提供在脈絡上相關且針對每個使用者的特定需求量身打造的回應。這是一個非常有價值的研究主題,因為 PLLM 可以 -顯著提升使用者滿意度,並在對話代理、推薦系統、情緒辨識、醫療 -助理等方面有廣泛的應用。這項調查從三個技術觀點回顧 PLLM 的最新進展:提示個人化脈絡(輸入層級)、微調個人化適配器(模型層級),以及對齊個人化偏好(目標層級)。為了提供更深入的見解,我們也 -討論目前的限制,並概述未來研究的幾個有希望的方向。這項調查的最新資訊可以在 -https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models 找到。 +摘要:大型語言模型 (LLM) 在廣泛的任務中展示了非凡的語境推理能力,特別是對於語言或影像等非結構化輸入。然而,LLM 難以處理結構化資料,例如圖形,因為它們無法理解非歐幾何結構。因此,在沒有額外微調的情況下,它們在圖形學習任務中的表現遠遠落後於圖形神經網路 (GNN)。在本文中,我們展示了在圖形資料上學習可以被概念化為檢索增強生成 (RAG) 過程,其中特定實例(例如,節點或邊)充當查詢,而圖形本身則作為檢索的語境。基於這個見解,我們提出了一系列 RAG 架構,以增強 LLM 在圖形學習任務中的語境學習能力。全面的評估表明,我們提出的 RAG 架構顯著提升了 LLM 在基於圖形的任務上的表現,特別是在預訓練的 LLM 必須在不修改或透過 API 存取的情況下使用的場景中。 -##### **Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos** -2502.11481v1 by Xiangxiang Cui, Zhongyu Li, Xiayue Fan, Peng Huang, Ying Wang, Meng Yang, Shi Chang, Jihua Zhu +##### **Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs** +2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu -The intersection of medical imaging and artificial intelligence has become an -important research direction in intelligent medical treatment, particularly in -the analysis of medical images using deep learning for clinical diagnosis. -Despite the advances, existing keyframe classification methods lack extraction -of time series features, while ultrasonic video classification based on -three-dimensional convolution requires uniform frame numbers across patients, -resulting in poor feature extraction efficiency and model classification -performance. This study proposes a novel video classification method based on -CNN and LSTM, introducing NLP's long and short sentence processing scheme into -video classification for the first time. The method reduces CNN-extracted image -features to 1x512 dimension, followed by sorting and compressing feature -vectors for LSTM training. Specifically, feature vectors are sorted by patient -video frame numbers and populated with padding value 0 to form variable -batches, with invalid padding values compressed before LSTM training to -conserve computing resources. Experimental results demonstrate that our -variable-frame CNNLSTM method outperforms other approaches across all metrics, -showing improvements of 3-6% in F1 score and 1.5% in specificity compared to -keyframe methods. The variable-frame CNNLSTM also achieves better accuracy and -precision than equal-frame CNNLSTM. These findings validate the effectiveness -of our approach in classifying variable-frame ultrasound videos and suggest -potential applications in other medical imaging modalities. +Data augmentation is necessary for graph representation learning due to the +scarcity and noise present in graph data. Most of the existing augmentation +methods overlook the context information inherited from the dataset as they +rely solely on the graph structure for augmentation. Despite the success of +some large language model-based (LLM) graph learning methods, they are mostly +white-box which require access to the weights or latent features from the +open-access LLMs, making them difficult to be democratized for everyone as +existing LLMs are mostly closed-source for commercial considerations. To +overcome these limitations, we propose a black-box context-driven graph data +augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the +text prompt as context-related information, we task the LLM with generating +knowledge graphs (KGs), which allow us to capture the structural interactions +from the text outputs. We then design a dynamic merging schema to +stochastically integrate the LLM-generated KGs into the original graph during +training. To control the sparsity of the augmented graph, we further devise a +granularity-aware prompting strategy and an instruction fine-tuning module, +which seamlessly generates text prompts according to different granularity +levels of the dataset. Extensive experiments on various graph learning tasks +validate the effectiveness of our method over existing graph data augmentation +methods. Notably, our approach excels in scenarios involving electronic health +records (EHRs), which validates its maximal utilization of contextual +knowledge, leading to enhanced predictive performance and interpretability. -摘要:醫學影像與人工智慧的交叉領域已成為智慧醫療的重要研究方向,特別是在臨床診斷中使用深度學習分析醫學影像。儘管有進展,現有的關鍵影格分類方法缺乏時間序列特徵的提取,而基於三維卷積的超音波影片分類需要患者之間的均勻影格數,導致特徵提取效率差和模型分類效能不佳。本研究提出了一種基於 CNN 和 LSTM 的新影片分類方法,首次將 NLP 的長短句處理機制引入影片分類中。該方法將 CNN 提取的影像特徵縮減為 1x512 維度,然後對特徵向量進行排序和壓縮以進行 LSTM 訓練。具體來說,特徵向量按患者影片影格數排序,並填充 0 補齊值以形成可變批次,在 LSTM 訓練前壓縮無效的補齊值以節省運算資源。實驗結果表明,我們的可變影格 CNNLSTM 方法在所有指標上都優於其他方法,與關鍵影格方法相比,F1 分數提高了 3-6%,特異性提高了 1.5%。可變影格 CNNLSTM 也比等影格 CNNLSTM 達到了更好的準確度和精確度。這些發現驗證了我們的方法在分類可變影格超音波影片中的有效性,並表明在其他醫學影像模式中具有潛在的應用。 +摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。 -##### **Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation** -2502.11456v1 by Yanyan Wang, Kechen Song, Yuyuan Liu, Shuai Ma, Yunhui Yan, Gustavo Carneiro +##### **PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference** +2502.13502v1 by Burc Gokden -Semi-supervised 3D medical image segmentation aims to achieve accurate -segmentation using few labelled data and numerous unlabelled data. The main -challenge in the design of semi-supervised learning methods consists in the -effective use of the unlabelled data for training. A promising solution -consists of ensuring consistent predictions across different views of the data, -where the efficacy of this strategy depends on the accuracy of the -pseudo-labels generated by the model for this consistency learning strategy. In -this paper, we introduce a new methodology to produce high-quality -pseudo-labels for a consistency learning strategy to address semi-supervised 3D -medical image segmentation. The methodology has three important contributions. -The first contribution is the Cooperative Rectification Learning Network (CRLN) -that learns multiple prototypes per class to be used as external knowledge -priors to adaptively rectify pseudo-labels at the voxel level. The second -contribution consists of the Dynamic Interaction Module (DIM) to facilitate -pairwise and cross-class interactions between prototypes and multi-resolution -image features, enabling the production of accurate voxel-level clues for -pseudo-label rectification. The third contribution is the Cooperative Positive -Supervision (CPS), which optimises uncertain representations to align with -unassertive representations of their class distributions, improving the model's -accuracy in classifying uncertain regions. Extensive experiments on three -public 3D medical segmentation datasets demonstrate the effectiveness and -superiority of our semi-supervised learning method. +We show that Large Language Model from Power Law Decoder Representations +(PLDR-LLM) is a foundational model whose deductive outputs are invariant +tensors up to a small perturbation. PLDR-LLM learns a singularity condition for +the deductive outputs that enable the once-inferred energy-curvature tensor +$\mathbf{G}_{LM}$ to replace the deep neural network of power law graph +attention (PLGA) generating the deductive outputs at inference. We demonstrate +that a cache for $\mathbf{G}_{LM}$ (G-cache) and KV-cache can be implemented in +a straightforward manner to improve the inference time. The invariance and +generalizable nature of deductive outputs is at a very high fidelity where +deductive outputs have same RMSE and determinant values up to 15 decimal places +after caching, and zero-shot benchmark scores remain unchanged. Ablation +studies show that learned deductive outputs have distinct loss and accuracy +characteristics from models pretrained with transferred, randomly initialized +or identity tensors as a constant tensor operator and an LLM with scaled-dot +product attention (SDPA) is a special case of PLDR-LLM where $\mathbf{G}_{LM}$ +is predefined as identity. The observed invariance characteristic introduces a +novel asymmetry between training and inference phases with caching. We outline +observed common characteristics of the deductive outputs for the learned +singularity condition. We provide an implementation of a training and inference +framework for PLDR-LLM with KV-cache and G-cache. -摘要:半监督 3D 医学影像分割旨在使用少量标记数据和大量未标记数据实现精确分割。半监督学习方法设计中的主要挑战在于有效使用未标记数据进行训练。一个有前景的解决方案是确保数据不同视图之间预测的一致性,其中此策略的有效性取决于模型为这种一致性学习策略生成的伪标签的准确性。在本文中,我们引入了一种新的方法来为一致性学习策略生成高质量的伪标签,以解决半监督 3D 医学图像分割问题。该方法有三个重要的贡献。第一个贡献是协作修正学习网络 (CRLN),它为每个类别学习多个原型,用作外部知识先验,以在体素级别自适应地修正伪标签。第二个贡献包括动态交互模块 (DIM),以促进原型和多分辨率图像特征之间的成对和跨类交互,从而能够生成用于伪标签修正的准确体素级线索。第三个贡献是协作正监督 (CPS),它优化不确定的表示以与其类分布的不确定表示保持一致,从而提高模型对不确定区域进行分类的准确性。在三个公共 3D 医学分割数据集上进行的大量实验表明了我们半监督学习方法的有效性和优越性。 +摘要:我們展示了來自冪律解碼器表示 (PLDR-LLM) 的大型語言模型是一個基礎模型,其演繹輸出是直到一個小擾動的不變張量。PLDR-LLM 學習演繹輸出的奇異條件,使曾經推斷出的能量曲率張量 $\mathbf{G}_{LM}$ 能夠取代產生演繹輸出的冪律圖注意力 (PLGA) 深度神經網路,進行推論。我們證明了 $\mathbf{G}_{LM}$ 快取 (G 快取) 和 KV 快取能夠以一種直接的方式實作,以改善推論時間。演繹輸出的不變性和可概化性質具有非常高的保真度,其中演繹輸出在快取後具有相同的 RMSE 和行列式值,直到小數點後 15 位,且零次學習基準分數保持不變。消融研究表明,學習的演繹輸出具有與使用轉移、隨機初始化或恆等張量作為常數張量算子和具有縮放點積注意力的 LLM 預先訓練的模型不同的損失和準確性特徵,並且 $\mathbf{G}_{LM}$ 被預先定義為恆等的 PLDR-LLM 的一個特例,其中 $\mathbf{G}_{LM}$ 被預先定義為恆等。觀察到的不變特徵引入了訓練和推論階段之間一個新的不對稱性,並帶有快取。我們概述了學習的奇異條件演繹輸出的觀察到的共同特徵。我們提供了一個具有 KV 快取和 G 快取的 PLDR-LLM 訓練和推論框架的實作。 -##### **A Survey of LLM-based Agents in Medicine: How far are we from Baymax?** -2502.11211v1 by Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Wenting Chen, Xiang Li, Yixuan Yuan +##### **Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction** +2502.13412v1 by Yanbang Sun, Qing Huang, Xiaoxue Ren, Zhenchang Xing, Xiaohong Li, Junjie Wang -Large Language Models (LLMs) are transforming healthcare through the -development of LLM-based agents that can understand, reason about, and assist -with medical tasks. This survey provides a comprehensive review of LLM-based -agents in medicine, examining their architectures, applications, and -challenges. We analyze the key components of medical agent systems, including -system profiles, clinical planning mechanisms, medical reasoning frameworks, -and external capacity enhancement. The survey covers major application -scenarios such as clinical decision support, medical documentation, training -simulations, and healthcare service optimization. We discuss evaluation -frameworks and metrics used to assess these agents' performance in healthcare -settings. While LLM-based agents show promise in enhancing healthcare delivery, -several challenges remain, including hallucination management, multimodal -integration, implementation barriers, and ethical considerations. The survey -concludes by highlighting future research directions, including advances in -medical reasoning inspired by recent developments in LLM architectures, -integration with physical systems, and improvements in training simulations. -This work provides researchers and practitioners with a structured overview of -the current state and future prospects of LLM-based agents in medicine. +The API Knowledge Graph (API KG) is a structured network that models API +entities and their relations, providing essential semantic insights for tasks +such as API recommendation, code generation, and API misuse detection. However, +constructing a knowledge-rich and reliable API KG presents several challenges. +Existing schema-based methods rely heavily on manual annotations to design KG +schemas, leading to excessive manual overhead. On the other hand, schema-free +methods, due to the lack of schema guidance, are prone to introducing noise, +reducing the KG's reliability. To address these issues, we propose the +Explore-Construct-Filter framework, an automated approach for API KG +construction based on large language models (LLMs). This framework consists of +three key modules: 1) KG exploration: LLMs simulate the workflow of annotators +to automatically design a schema with comprehensive type triples, minimizing +human intervention; 2) KG construction: Guided by the schema, LLMs extract +instance triples to construct a rich yet unreliable API KG; 3) KG filtering: +Removing invalid type triples and suspicious instance triples to construct a +rich and reliable API KG. Experimental results demonstrate that our method +surpasses the state-of-the-art method, achieving a 25.2% improvement in F1 +score. Moreover, the Explore-Construct-Filter framework proves effective, with +the KG exploration module increasing KG richness by 133.6% and the KG filtering +module improving reliability by 26.6%. Finally, cross-model experiments confirm +the generalizability of our framework. -摘要:大型語言模型 (LLM) 透過開發可理解、推理並協助醫療任務的 LLM 基礎代理人,轉變了醫療保健。本調查提供了 LLM 基礎代理人在醫學中的全面回顧,探討其架構、應用和挑戰。我們分析了醫療代理系統的主要組成部分,包括系統概況、臨床規劃機制、醫療推理架構和外部能力提升。本調查涵蓋了主要的應用場景,例如臨床決策支援、醫療文件、訓練模擬和醫療保健服務最佳化。我們討論了用於評估這些代理人在醫療保健環境中表現的評估架構和指標。雖然 LLM 基礎代理人顯示出在增強醫療保健提供方面的潛力,但仍有許多挑戰,包括幻覺管理、多模態整合、實施障礙和倫理考量。本調查最後強調了未來的研究方向,包括受 LLM 架構近期發展啟發的醫療推理進展、與物理系統的整合和訓練模擬的改進。這項工作為研究人員和從業人員提供了 LLM 基礎代理人在醫學中當前狀態和未來前景的結構化概觀。 +摘要:API 知識圖譜 (API KG) 是一個結構化網路,用於建模 API 實體及其關係,提供基本語義見解,以執行 API 建議、程式碼產生和 API 誤用偵測等任務。然而,建構一個知識豐富且可靠的 API KG 會產生若干挑戰。現有的基於架構的方法嚴重依賴手動註解來設計 KG 架構,導致過度的手動開銷。另一方面,由於缺乏架構指導,無架構的方法容易引入雜訊,降低 KG 的可靠性。為了解決這些問題,我們提出了探索建構過濾架構,這是一種基於大型語言模型 (LLM) 的自動化 API KG 建構方法。此架構包含三個關鍵模組:1) KG 探索:LLM 模擬註解者的工作流程,自動設計具有完整類型三元組的架構,將人為干預降至最低;2) KG 建構:在架構的指導下,LLM 提取實例三元組來建構豐富但不可靠的 API KG;3) KG 過濾:移除無效的類型三元組和可疑的實例三元組,以建構豐富且可靠的 API KG。實驗結果表明,我們的方法優於最先進的方法,在 F1 分數上提高了 25.2%。此外,探索建構過濾架構被證明是有效的,其中 KG 探索模組將 KG 豐富度提高了 133.6%,而 KG 過濾模組將可靠性提高了 26.6%。最後,跨模型實驗證實了我們架構的泛化性。 -##### **RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer** -2502.11179v1 by Shilong Yang, Qi Zang, Chulong Zhang, Lingfeng Huang, Yaoqin Xie +##### **Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval** +2502.13369v1 by Aditya Sharma, Luis Lara, Amal Zouaq, Christopher J. Pal -Traditional Chinese acupuncture methods often face controversy in clinical -practice due to their high subjectivity. Additionally, current -intelligent-assisted acupuncture systems have two major limitations: slow -acupoint localization speed and low accuracy. To address these limitations, a -new method leverages the excellent inference efficiency of the state-space -model Mamba, while retaining the advantages of the attention mechanism in the -traditional DETR architecture, to achieve efficient global information -integration and provide high-quality feature information for acupoint -localization tasks. Furthermore, by employing the concept of residual -likelihood estimation, it eliminates the need for complex upsampling processes, -thereby accelerating the acupoint localization task. Our method achieved -state-of-the-art (SOTA) accuracy on a private dataset of acupoints on the human -back, with an average Euclidean distance pixel error (EPE) of 7.792 and an -average time consumption of 10.05 milliseconds per localization task. Compared -to the second-best algorithm, our method improved both accuracy and speed by -approximately 14\%. This significant advancement not only enhances the efficacy -of acupuncture treatment but also demonstrates the commercial potential of -automated acupuncture robot systems. Access to our method is available at -https://github.com/Sohyu1/RT-DEMT +The ability to generate SPARQL queries from natural language questions is +crucial for ensuring efficient and accurate retrieval of structured data from +knowledge graphs (KG). While large language models (LLMs) have been widely +adopted for SPARQL query generation, they are often susceptible to +hallucinations and out-of-distribution errors when producing KG elements like +Uniform Resource Identifiers (URIs) based on internal parametric knowledge. +This often results in content that appears plausible but is factually +incorrect, posing significant challenges for their use in real-world +information retrieval (IR) applications. This has led to increased research +aimed at detecting and mitigating such errors. In this paper, we introduce PGMR +(Post-Generation Memory Retrieval), a modular framework that incorporates a +non-parametric memory module to retrieve KG elements and enhance LLM-based +SPARQL query generation. Our experimental results indicate that PGMR +consistently delivers strong performance across diverse datasets, data +distributions, and LLMs. Notably, PGMR significantly mitigates URI +hallucinations, nearly eliminating the problem in several scenarios. -摘要:傳統的中醫針灸方法由於其高度主觀性,在臨床實務中經常面臨爭議。此外,現有的智慧輔助針灸系統有兩大限制:取穴速度慢以及準確度低。為了解決這些限制,一種新的方法利用了狀態空間模型 Mamba 優異的推理效率,同時保留了傳統 DETR 架構中注意力機制的優點,以實現高效的全局資訊整合,並為取穴任務提供高品質的特徵資訊。此外,透過採用殘差似然估計的概念,它消除了對複雜上採樣程序的需求,從而加速了取穴任務。我們的模型在人體背部穴位私人資料集上達到了最先進 (SOTA) 的準確度,平均歐幾里得距離像素誤差 (EPE) 為 7.792,平均每個取穴任務耗時 10.05 毫秒。與第二好的演算法相比,我們的模型在準確度和速度上都提高了大約 14%。這項重大進展不僅提高了針灸治療的療效,也證明了自動化針灸機器人系統的商業潛力。我們的模型可以在 https://github.com/Sohyu1/RT-DEMT 取得 +摘要:從自然語言問題中產生 SPARQL 查詢的能力對於確保從知識圖譜 (KG) 中有效率且準確地擷取結構化資料至關重要。儘管大型語言模型 (LLM) 已廣泛用於 SPARQL 查詢產生,但它們在根據內部參數化知識產生像統一資源識別碼 (URI) 等 KG 元素時,通常容易出現幻覺和分布外錯誤。這通常會導致內容看似合理,但事實上並不正確,對其在真實世界資訊檢索 (IR) 應用中的使用構成重大挑戰。這導致針對偵測和減輕此類錯誤的研究增加。在本文中,我們介紹 PGMR(後產生記憶體檢索),這是一個模組化架構,它結合了一個非參數記憶體模組來檢索 KG 元素並增強基於 LLM 的 SPARQL 查詢產生。我們的實驗結果表明,PGMR 在不同的資料集、資料分佈和 LLM 中始終提供強大的效能。值得注意的是,PGMR 大幅減輕了 URI 幻覺,在許多情況下幾乎消除了問題。 -##### **Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications** -2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy +##### **Craw4LLM: Efficient Web Crawling for LLM Pretraining** +2502.13347v1 by Shi Yu, Zhiyuan Liu, Chenyan Xiong -Large language models (LLMs) have significantly advanced the field of natural -language generation. However, they frequently generate unverified outputs, -which compromises their reliability in critical applications. In this study, we -propose an innovative framework that combines structured biomedical knowledge -with LLMs through a retrieval-augmented generation technique. Our system -develops a thorough knowledge graph by identifying and refining causal -relationships and named entities from medical abstracts related to age-related -macular degeneration (AMD). Using a vector-based retrieval process and a -locally deployed language model, our framework produces responses that are both -contextually relevant and verifiable, with direct references to clinical -evidence. Experimental results show that this method notably decreases -hallucinations, enhances factual precision, and improves the clarity of -generated responses, providing a robust solution for advanced biomedical -chatbot applications. +Web crawl is a main source of large language models' (LLMs) pretraining data, +but the majority of crawled web pages are discarded in pretraining due to low +data quality. This paper presents Crawl4LLM, an efficient web crawling method +that explores the web graph based on the preference of LLM pretraining. +Specifically, it leverages the influence of a webpage in LLM pretraining as the +priority score of the web crawler's scheduler, replacing the standard graph +connectivity based priority. Our experiments on a web graph containing 900 +million webpages from a commercial search engine's index demonstrate the +efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just +21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream +performances of previous crawls, significantly reducing the crawling waste and +alleviating the burdens on websites. Our code is publicly available at +https://github.com/cxcscmu/Crawl4LLM. -摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。 +摘要:網路爬蟲是大型語言模型 (LLM) 預訓練資料的主要來源, +但大多數已爬取的網頁在預訓練中會因為資料品質低落而被捨棄。 +本文提出 Crawl4LLM,這是一種有效率的網路爬取方法, +它會根據 LLM 預訓練的偏好來探索網路圖。 +具體來說,它利用網頁在 LLM 預訓練中的影響力作為網路爬蟲排程器的優先分數, +取代標準的圖形連線優先順序。 +我們在一個包含來自商業搜尋引擎索引的 9 億個網頁的網路圖上進行的實驗, +證明了 Crawl4LLM 在取得高品質預訓練資料方面的效率。 +只爬取了 21% 的網址,以 Crawl4LLM 資料預訓練的 LLM 就達到了先前爬取的相同下游效能, +大幅減少了爬取浪費,並減輕了對網站的負擔。 +我們的程式碼已公開於 https://github.com/cxcscmu/Crawl4LLM。 -##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration** -2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang +##### **K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction** +2502.13344v1 by Tassallah Abdullahi, Ioanna Gemou, Nihal V. Nayak, Ghulam Murtaza, Stephen H. Bach, Carsten Eickhoff, Ritambhara Singh -Automatic depression detection provides cues for early clinical intervention -by clinicians. Clinical interviews for depression detection involve dialogues -centered around multiple themes. Existing studies primarily design end-to-end -neural network models to capture the hierarchical structure of clinical -interview dialogues. However, these methods exhibit defects in modeling the -thematic content of clinical interviews: 1) they fail to capture intra-theme -and inter-theme correlation explicitly, and 2) they do not allow clinicians to -intervene and focus on themes of interest. To address these issues, this paper -introduces an interactive depression detection framework. This framework -leverages in-context learning techniques to identify themes in clinical -interviews and then models both intra-theme and inter-theme correlation. -Additionally, it employs AI-driven feedback to simulate the interests of -clinicians, enabling interactive adjustment of theme importance. PDIMC achieves -absolute improvements of 35\% and 12\% compared to the state-of-the-art on the -depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of -modeling theme correlation and incorporating interactive external feedback. +Drug discovery is a complex and time-intensive process that requires +identifying and validating new therapeutic candidates. Computational approaches +using large-scale biomedical knowledge graphs (KGs) offer a promising solution +to accelerate this process. However, extracting meaningful insights from +large-scale KGs remains challenging due to the complexity of graph traversal. +Existing subgraph-based methods are tailored to graph neural networks (GNNs), +making them incompatible with other models, such as large language models +(LLMs). We introduce K-Paths, a retrieval framework that extracts structured, +diverse, and biologically meaningful paths from KGs. Integrating these paths +enables LLMs and GNNs to effectively predict unobserved drug-drug and +drug-disease interactions. Unlike traditional path-ranking approaches, K-Paths +retrieves and transforms paths into a structured format that LLMs can directly +process, facilitating explainable reasoning. K-Paths employs a diversity-aware +adaptation of Yen's algorithm to retrieve the K shortest loopless paths between +entities in an interaction query, prioritizing biologically relevant and +diverse relationships. Our experiments on benchmark datasets show that K-Paths +improves the zero-shot performance of Llama 8.1B's F1-score by 12.45 points on +drug repurposing and 13.42 points on interaction severity prediction. We also +show that Llama 70B achieves F1-score gains of 6.18 and 8.46 points, +respectively. K-Paths also improves the supervised training efficiency of +EmerGNN, a state-of-the-art GNN, by reducing KG size by 90% while maintaining +strong predictive performance. Beyond its scalability and efficiency, K-Paths +uniquely bridges the gap between KGs and LLMs, providing explainable rationales +for predicted interactions. These capabilities show that K-Paths is a valuable +tool for efficient data-driven drug discovery. -摘要:自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而,這些方法在建模臨床訪談的主題內容時表現出缺陷:1)它們無法明確捕捉主題內和主題間的關聯性,以及 2)它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題,本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題,然後對主題內和主題間的關聯性進行建模。此外,它採用 AI 驅動的回饋來模擬臨床醫師的興趣,實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比,PDIMC 的絕對改進率分別為 35% 和 12%,這證明了對主題關聯性建模和納入互動式外部回饋的有效性。 +摘要:藥物發現是一個複雜且耗時的過程,需要識別和驗證新的治療候選藥物。使用大型生物醫學知識圖譜 (KG) 的計算方法提供了一個有希望的解決方案來加速這個過程。然而,由於圖形遍歷的複雜性,從大型 KG 中提取有意義的見解仍然具有挑戰性。現有的子圖方法是針對圖神經網路 (GNN) 量身打造的,這使得它們與其他模型(例如大型語言模型 (LLM))不兼容。我們介紹了 K-Paths,這是一個檢索框架,它從 KG 中提取結構化、多樣化且具有生物意義的路徑。整合這些路徑使 LLM 和 GNN 能夠有效預測未觀察到的藥物-藥物和藥物-疾病交互。與傳統的路徑排序方法不同,K-Paths 檢索路徑並將其轉換為 LLM 可以直接處理的結構化格式,從而促進可解釋的推理。K-Paths 採用了 Yen 演算法的多樣性感知適應,以檢索交互查詢中實體之間的 K 個最短無環路徑,優先考慮生物相關且多樣化的關係。我們在基準資料集上的實驗表明,K-Paths 將 Llama 8.1B 的 F1 分數在藥物再利用上提高了 12.45 分,在交互嚴重性預測上提高了 13.42 分。我們還表明,Llama 70B 分別獲得了 6.18 分和 8.46 分的 F1 分數增益。K-Paths 還提高了最先進的 GNN EmerGNN 的監督訓練效率,同時將 KG 大小減少了 90%,同時保持強大的預測性能。除了其可擴展性和效率之外,K-Paths 獨特地彌合了 KG 和 LLM 之間的差距,為預測的交互提供了可解釋的依據。這些功能表明,K-Paths 是用於高效資料驅動藥物發現的寶貴工具。 + +##### **Grounding LLM Reasoning with Knowledge Graphs** +2502.13247v1 by Alfonso Amayuelas, Joy Sain, Simerjot Kaur, Charese Smiley + +Knowledge Graphs (KGs) are valuable tools for representing relationships +between entities in a structured format. Traditionally, these knowledge bases +are queried to extract specific information. However, question-answering (QA) +over such KGs poses a challenge due to the intrinsic complexity of natural +language compared to the structured format and the size of these graphs. +Despite these challenges, the structured nature of KGs can provide a solid +foundation for grounding the outputs of Large Language Models (LLMs), offering +organizations increased reliability and control. + Recent advancements in LLMs have introduced reasoning methods at inference +time to improve their performance and maximize their capabilities. In this +work, we propose integrating these reasoning strategies with KGs to anchor +every step or "thought" of the reasoning chains in KG data. Specifically, we +evaluate both agentic and automated search methods across several reasoning +strategies, including Chain-of-Thought (CoT), Tree-of-Thought (ToT), and +Graph-of-Thought (GoT), using GRBench, a benchmark dataset for graph reasoning +with domain-specific graphs. Our experiments demonstrate that this approach +consistently outperforms baseline models, highlighting the benefits of +grounding LLM reasoning processes in structured KG data. + +摘要:知識圖譜 (KG) 是以結構化格式表示實體之間關係的寶貴工具。傳統上,這些知識庫會被查詢以萃取特定資訊。然而,由於自然語言與結構化格式之間的內在複雜性,以及這些圖譜的規模,在這些 KG 上進行問答 (QA) 會構成挑戰。儘管有這些挑戰,KG 的結構化特性可以為大型語言模型 (LLM) 的輸出提供穩固的基礎,為組織提供更高的可靠性和控制力。 +LLM 的最新進展在推論時間引入了推理方法,以提升其效能並最大化其能力。在這項工作中,我們建議將這些推理策略與 KG 整合,以將推理鏈的每一步或「思考」錨定在 KG 資料中。具體來說,我們在多種推理策略中評估代理和自動化搜尋方法,包括思考鏈 (CoT)、思考樹 (ToT) 和思考圖 (GoT),使用 GRBench,這是一個針對圖形推理的基準資料集,其中包含特定領域的圖形。我們的實驗證明,這種方法始終優於基準模型,突顯了將 LLM 推理過程建立在結構化 KG 資料中的好處。 -##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening** -2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu +##### **Learning to Defer for Causal Discovery with Imperfect Experts** +2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin -Due to the rise in antimicrobial resistance, identifying novel compounds with -antibiotic potential is crucial for combatting this global health issue. -However, traditional drug development methods are costly and inefficient. -Recognizing the pressing need for more effective solutions, researchers have -turned to machine learning techniques to streamline the prediction and -development of novel antibiotic compounds. While foundation models have shown -promise in antibiotic discovery, current mainstream efforts still fall short of -fully leveraging the potential of multimodal molecular data. Recent studies -suggest that contrastive learning frameworks utilizing multimodal data exhibit -excellent performance in representation learning across various domains. -Building upon this, we introduce CL-MFAP, an unsupervised contrastive learning -(CL)-based multimodal foundation (MF) model specifically tailored for -discovering small molecules with potential antibiotic properties (AP) using -three types of molecular data. This model employs 1.6 million bioactive -molecules with drug-like properties from the ChEMBL dataset to jointly pretrain -three encoders: (1) a transformer-based encoder with rotary position embedding -for processing SMILES strings; (2) another transformer-based encoder, -incorporating a novel bi-level routing attention mechanism to handle molecular -graph representations; and (3) a Morgan fingerprint encoder using a multilayer -perceptron, to achieve the contrastive learning purpose. The CL-MFAP -outperforms baseline models in antibiotic property prediction by effectively -utilizing different molecular modalities and demonstrates superior -domain-specific performance when fine-tuned for antibiotic-related property -prediction tasks. +Integrating expert knowledge, e.g. from large language models, into causal +discovery algorithms can be challenging when the knowledge is not guaranteed to +be correct. Expert recommendations may contradict data-driven results, and +their reliability can vary significantly depending on the domain or specific +query. Existing methods based on soft constraints or inconsistencies in +predicted causal relationships fail to account for these variations in +expertise. To remedy this, we propose L2D-CD, a method for gauging the +correctness of expert recommendations and optimally combining them with +data-driven causal discovery results. By adapting learning-to-defer (L2D) +algorithms for pairwise causal discovery (CD), we learn a deferral function +that selects whether to rely on classical causal discovery methods using +numerical data or expert recommendations based on textual meta-data. We +evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its +superior performance compared to both the causal discovery method and the +expert used in isolation. Moreover, our approach identifies domains where the +expert's performance is strong or weak. Finally, we outline a strategy for +generalizing this approach to causal discovery on graphs with more than two +variables, paving the way for further research in this area. -摘要:由於抗菌藥物抗性上升,找出具有抗生素潛力的新型化合物對於對抗此項全球性健康議題至關重要。不過,傳統的藥物開發方法成本高昂且效率不彰。研究人員體認到對於更有效解決方案的迫切需求,因此轉向機器學習技術來簡化新型抗生素化合物的預測和開發。儘管基礎模型在抗生素發現方面展現潛力,目前的普遍做法仍未充分利用多模態分子資料的潛力。最近的研究顯示,利用多模態資料的對比學習架構在各種領域的表徵學習中展現出優異的效能。有鑑於此,我們引進 CL-MFAP,一種無監督對比學習 (CL) 為基礎的多模態基礎 (MF) 模型,專門用於使用三種類型的分子資料發現具有潛在抗生素特性的低分子。此模型採用 ChEMBL 資料集中的 160 萬個具有類藥物特性的生物活性分子,以聯合預訓練三個編碼器:(1) 一個具有旋轉位置嵌入的基於Transformer的編碼器,用於處理 SMILES 字串;(2) 另一個基於Transformer的編碼器,結合一種新穎的雙層路由注意機制來處理分子圖表表徵;以及 (3) 一個使用多層感知器的 Morgan 指紋編碼器,以達成對比學習的目的。CL-MFAP 透過有效利用不同的分子模式在抗生素特性預測方面優於基準模型,並且在針對抗生素相關特性預測任務進行微調時展現出優異的特定領域效能。 +摘要:整合专家知識,例如從大型語言模型中整合到因果發現演算法中,當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾,而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點,我們提出了 L2D-CD,一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD),我們學習了一個延遲函數,用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD,並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外,我們的做法識別出專家表現強或弱的領域。最後,我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略,為此領域的進一步研究鋪平了道路。 -##### **Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images** -2502.10908v1 by Sevim Cengiz, Ibraheem Hamdi, Mohammad Yaqub +##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks** +2502.13025v1 by Markus J. Buehler -Fetal gestational age (GA) is vital clinical information that is estimated -during pregnancy in order to assess fetal growth. This is usually performed by -measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan -which is then correlated with fetal age and growth trajectory. A major issue -when performing the CRL measurement is ensuring that the image is acquired at -the correct view, otherwise it could be misleading. Although clinical -guidelines specify the criteria for the correct CRL view, sonographers may not -regularly adhere to such rules. In this paper, we propose a new deep -learning-based solution that is able to verify the adherence of a CRL image to -clinical guidelines in order to assess image quality and facilitate accurate -estimation of GA. We first segment out important fetal structures then use the -localized structures to perform a clinically-guided mapping that verifies the -adherence of criteria. The segmentation method combines the benefits of -Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment -fetal structures in ultrasound images and localize important fetal landmarks. -For segmentation purposes, we compare our proposed work with UNet and show that -our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, -we compare the output of the mapping with classification CNNs when assessing -the clinical criteria and the overall acceptability of CRL images. We show that -the proposed mapping is not only explainable but also more accurate than the -best performing classification CNNs. +We present an agentic, autonomous graph expansion framework that iteratively +structures and refines knowledge in situ. Unlike conventional knowledge graph +construction methods relying on static extraction or single-pass learning, our +approach couples a reasoning-native large language model with a continually +updated graph representation. At each step, the system actively generates new +concepts and relationships, merges them into a global graph, and formulates +subsequent prompts based on its evolving structure. Through this +feedback-driven loop, the model organizes information into a scale-free network +characterized by hub formation, stable modularity, and bridging nodes that link +disparate knowledge clusters. Over hundreds of iterations, new nodes and edges +continue to appear without saturating, while centrality measures and shortest +path distributions evolve to yield increasingly distributed connectivity. Our +analysis reveals emergent patterns, such as the rise of highly connected 'hub' +concepts and the shifting influence of 'bridge' nodes, indicating that agentic, +self-reinforcing graph construction can yield open-ended, coherent knowledge +structures. Applied to materials design problems, we present compositional +reasoning experiments by extracting node-specific and synergy-level principles +to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that +transcend rote summarization and strengthen the framework's potential for +open-ended scientific discovery. We discuss other applications in scientific +discovery and outline future directions for enhancing scalability and +interpretability. -摘要:胎兒妊娠年齡 (GA) 是重要的臨床資訊,會在懷孕期間估計,以評估胎兒生長。這通常是透過在約會掃描中測量超音波影像中的頭臀長度 (CRL) 來執行,然後與胎兒年齡和生長軌跡相關聯。執行 CRL 測量時的一個主要問題是確保影像是在正確的視角下取得,否則可能會產生誤導。儘管臨床指南規定了正確 CRL 視角的標準,但超音波檢查員可能不會定期遵守這些規則。在本文中,我們提出了一個新的深度學習解決方案,能夠驗證 CRL 影像是否符合臨床指南,以評估影像品質並促進對 GA 的準確估計。我們首先分割出重要的胎兒結構,然後使用局部結構來執行臨床指導的對應,以驗證標準的遵守情況。分割方法結合了卷積神經網路 (CNN) 和視覺轉換器 (ViT) 的優點,以分割超音波影像中的胎兒結構並定位重要的胎兒標誌。為了分割目的,我們將我們提出的工作與 UNet 進行比較,並顯示我們基於 CNN/ViT 的方法優於 UNet 的最佳化版本。此外,我們在評估臨床標準和 CRL 影像的整體可接受性時,將對應的輸出與分類 CNN 進行比較。我們表明,所提出的對應不僅可以解釋,而且比效能最佳的分類 CNN 更準確。 +摘要:我們提出一個能動的、自主的圖形擴展框架,它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同,我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中,系統主動產生新的概念和關係,將它們合併到一個全域圖形中,並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈,模型將資訊組織成一個無標度網路,其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中,新的節點和邊緣會持續出現,而不會飽和,同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式,例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移,這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題,我們提出組合推理實驗,透過提取特定於節點的原則和協同效應層級原則,以促進真正新穎的知識綜合,產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用,並概述了增強可擴充性和可解釋性的未來方向。 -##### **Breaking Down the Hierarchy: A New Approach to Leukemia Classification** -2502.10899v1 by Ibraheem Hamdi, Hosam El-Gendy, Ahmed Sharshar, Mohamed Saeed, Muhammad Ridzuan, Shahrukh K. Hashmi, Naveed Syed, Imran Mirza, Shakir Hussain, Amira Mahmoud Abdalla, Mohammad Yaqub +##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge** +2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany -The complexities inherent to leukemia, multifaceted cancer affecting white -blood cells, pose considerable diagnostic and treatment challenges, primarily -due to reliance on laborious morphological analyses and expert judgment that -are susceptible to errors. Addressing these challenges, this study presents a -refined, comprehensive strategy leveraging advanced deep-learning techniques -for the classification of leukemia subtypes. We commence by developing a -hierarchical label taxonomy, paving the way for differentiating between various -subtypes of leukemia. The research further introduces a novel hierarchical -approach inspired by clinical procedures capable of accurately classifying -diverse types of leukemia alongside reactive and healthy cells. An integral -part of this study involves a meticulous examination of the performance of -Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) as -classifiers. The proposed method exhibits an impressive success rate, achieving -approximately 90\% accuracy across all leukemia subtypes, as substantiated by -our experimental results. A visual representation of the experimental findings -is provided to enhance the model's explainability and aid in understanding the -classification process. +Large Language Models (LLMs) have significantly advanced medical +question-answering by leveraging extensive clinical data and medical +literature. However, the rapid evolution of medical knowledge and the +labor-intensive process of manually updating domain-specific resources pose +challenges to the reliability of these systems. To address this, we introduce +Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates +the construction and continuous updating of medical knowledge graphs, +integrates reasoning, and retrieves current external evidence, such as PubMed +and WikiSearch. By dynamically linking new findings and complex medical +concepts, AMG-RAG not only improves accuracy but also enhances interpretability +in medical queries. + Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness +of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of +66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to +100 times larger. Notably, these improvements are achieved without increasing +computational overhead, highlighting the critical role of automated knowledge +graph generation and external evidence retrieval in delivering up-to-date, +trustworthy medical insights. -摘要:白血病的复杂性源于它是一种影响白血球的多面性癌症,主要由于依赖费力的形态分析和容易出错的专家判断,因此带来了相当大的诊断和治疗挑战。为了应对这些挑战,本研究提出了一种精细且全面的策略,利用先进的深度学习技术对白血病亚型进行分类。我们首先开发了一个分层的标签分类法,为区分白血病的各种亚型铺平了道路。该研究进一步引入了一种新颖的分层方法,该方法受临床程序的启发,能够准确地对各种类型的白血病以及反应性和健康细胞进行分类。本研究的一个组成部分涉及对卷积神经网络 (CNN) 和视觉变压器 (ViT) 作为分类器的性能进行细致检查。所提出的方法展示了令人印象深刻的成功率,在所有白血病亚型中实现了大约 90% 的准确率,我们的实验结果证实了这一点。提供了实验结果的可视化表示,以增强模型的可解释性并帮助理解分类过程。 +摘要:大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻,大幅提升了醫療問題解答的進步。然而,醫療知識的快速演進和手動更新特定領域資源的繁複程序,對這些系統的可靠性構成挑戰。為了解決這個問題,我們引入了適應性醫療圖表 RAG (AMG-RAG),這是一個自動化建構和持續更新醫療知識圖表的綜合架構,整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念,AMG-RAG 不僅提升了準確性,也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性,在 MEDQA 上達到了 74.1% 的 F1 分數,在 MEDMCQA 上達到了 66.34% 的準確度,優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是,這些改進是在不增加運算負擔的情況下實現的,突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。 -##### **An Empirical Analysis of Uncertainty in Large Language Model Evaluations** -2502.10709v1 by Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang +##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs** +2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi -As LLM-as-a-Judge emerges as a new paradigm for assessing large language -models (LLMs), concerns have been raised regarding the alignment, bias, and -stability of LLM evaluators. While substantial work has focused on alignment -and bias, little research has concentrated on the stability of LLM evaluators. -In this paper, we conduct extensive experiments involving 9 widely used LLM -evaluators across 2 different evaluation settings to investigate the -uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators -exhibit varying uncertainty based on model families and sizes. With careful -comparative analyses, we find that employing special prompting strategies, -whether during inference or post-training, can alleviate evaluation uncertainty -to some extent. By utilizing uncertainty to enhance LLM's reliability and -detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an -uncertainty-aware LLM evaluator named ConfiLM using a human-annotated -fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually -designed test set sourced from the 2024 Olympics. Experimental results -demonstrate that incorporating uncertainty as additional information during the -fine-tuning phase can largely improve the model's evaluation performance in OOD -scenarios. The code and data are released at: -https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty. +Recent studies have combined Large Language Models (LLMs) with Knowledge +Graphs (KGs) to enhance reasoning, improving inference accuracy without +additional training while mitigating hallucination. However, existing +frameworks are often rigid, struggling to adapt to KG or task changes. They +also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. +To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that +separates reasoning into two roles: an Operator (a low-capacity LLM) that +gathers evidence and a Supervisor (a high-capacity LLM) that makes final +judgments. This design is cost-efficient for LLM inference while still +maintaining strong reasoning accuracy. Additionally, R2-KG employs an +Abstention mechanism, generating answers only when sufficient evidence is +collected from KG, which significantly enhances reliability. Experiments across +multiple KG-based reasoning tasks show that R2-KG consistently outperforms +baselines in both accuracy and reliability, regardless of the inherent +capability of LLMs used as the Operator. Further experiments reveal that the +single-agent version of R2-KG, equipped with a strict self-consistency +strategy, achieves significantly higher-than-baseline reliability while +reducing inference cost. However, it also leads to a higher abstention rate in +complex KGs. Our findings establish R2-KG as a flexible and cost-effective +solution for KG-based reasoning. It reduces reliance on high-capacity LLMs +while ensuring trustworthy inference. -摘要:隨著 LLM 作為法官的新典範出現,用於評估大型語言模型 (LLM) 的 LLM 評估器在對齊、偏差和穩定性方面引發了關注。儘管大量工作集中在對齊和偏差上,但很少有研究集中在 LLM 評估器的穩定性上。在本文中,我們進行了廣泛的實驗,涉及 9 個廣泛使用的 LLM 評估器,跨越 2 個不同的評估設定,以調查基於模型的 LLM 評估中的不確定性。我們精確指出 LLM 評估器根據模型系列和大小表現出不同的不確定性。通過仔細的比較分析,我們發現採用特殊的提示策略(無論是在推理過程中還是訓練後)可以在一定程度上緩解評估不確定性。通過利用不確定性來增強 LLM 在 Out-Of-Distribution (OOD) 數據中的可靠性和檢測能力,我們進一步微調了一個名為 ConfiLM 的不確定性感知 LLM 評估器,使用人工註釋的微調設置,並評估 ConfiLM 在手動設計的、來自 2024 年奧運會的測試集上的 OOD 評估能力。實驗結果表明,在微調階段將不確定性作為附加信息納入其中可以在很大程度上提高模型在 OOD 場景中的評估性能。代碼和數據發布於: -https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty。 +摘要:最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理,在不额外训练的情况下提高推理准确性,同时减轻幻觉。然而,现有的框架通常很僵化,难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠(即值得信赖)的推理。为了解决这个问题,我们引入了 R2-KG,这是一个即插即用、双代理框架,它将推理分为两个角色:一个收集证据的操作员(低容量 LLM)和一个做出最终判断的监督员(高容量 LLM)。这种设计在 LLM 推理方面具有成本效益,同时仍保持强大的推理准确性。此外,R2-KG 采用弃权机制,仅在从知识图谱收集到足够证据时才生成答案,这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明,R2-KG 在准确性和可靠性方面始终优于基线,而与用作操作员的 LLM 的固有能力无关。进一步的实验表明,R2-KG 的单代理版本配备了严格的自一致性策略,实现了明显高于基线的可靠性,同时降低了推理成本。然而,它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖,同时确保了可信的推理。 -##### **Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model** -2502.10707v1 by Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong +##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research** +2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang -Electrocardiogram (ECG) is essential for the clinical diagnosis of -arrhythmias and other heart diseases, but deep learning methods based on ECG -often face limitations due to the need for high-quality annotations. Although -previous ECG self-supervised learning (eSSL) methods have made significant -progress in representation learning from unannotated ECG data, they typically -treat ECG signals as ordinary time-series data, segmenting the signals using -fixed-size and fixed-step time windows, which often ignore the form and rhythm -characteristics and latent semantic relationships in ECG signals. In this work, -we introduce a novel perspective on ECG signals, treating heartbeats as words -and rhythms as sentences. Based on this perspective, we first designed the -QRS-Tokenizer, which generates semantically meaningful ECG sentences from the -raw ECG signals. Building on these, we then propose HeartLang, a novel -self-supervised learning framework for ECG language processing, learning -general representations at form and rhythm levels. Additionally, we construct -the largest heartbeat-based ECG vocabulary to date, which will further advance -the development of ECG language processing. We evaluated HeartLang across six -public ECG datasets, where it demonstrated robust competitiveness against other -eSSL methods. Our data and code are publicly available at -https://github.com/PKUDigitalHealth/HeartLang. +The rapid advancement of perovskite solar cells (PSCs) has led to an +exponential growth in research publications, creating an urgent need for +efficient knowledge management and reasoning systems in this domain. We present +a comprehensive knowledge-enhanced system for PSCs that integrates three key +components. First, we develop Perovskite-KG, a domain-specific knowledge graph +constructed from 1,517 research papers, containing 23,789 entities and 22,272 +relationships. Second, we create two complementary datasets: Perovskite-Chat, +comprising 55,101 high-quality question-answer pairs generated through a novel +multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully +curated materials science problems. Third, we introduce two specialized large +language models: Perovskite-Chat-LLM for domain-specific knowledge assistance +and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental +results demonstrate that our system significantly outperforms existing models +in both domain-specific knowledge retrieval and scientific reasoning tasks, +providing researchers with effective tools for literature review, experimental +design, and complex problem-solving in PSC research. -摘要:心電圖 (ECG) 對於心律不整和其他心臟疾病的臨床診斷至關重要,但基於心電圖的深度學習方法通常會因需要高品質註解而面臨限制。儘管先前的 ECG 自我監督學習 (eSSL) 方法在從未註解的 ECG 資料中學習表徵方面取得顯著進展,但它們通常將 ECG 訊號視為普通的時間序列資料,使用固定大小和固定步長的時窗對訊號進行分段,這通常會忽略 ECG 訊號中的形式和節律特徵以及潛在的語義關係。在這項工作中,我們對 ECG 訊號引入了新的觀點,將心跳視為單字,將節律視為句子。基於此觀點,我們首先設計了 QRS-Tokenizer,它從原始 ECG 訊號中產生語義有意義的 ECG 句子。在此基礎上,我們提出了 HeartLang,一種用於 ECG 語言處理的新型自我監督學習框架,在形式和節律層面上學習一般表徵。此外,我們構建了迄今為止最大的基於心跳的 ECG 詞彙表,這將進一步促進 ECG 語言處理的發展。我們在六個公開的 ECG 資料集上評估了 HeartLang,它展示了與其他 eSSL 方法相比的強大競爭力。我們的資料和程式碼可在 https://github.com/PKUDigitalHealth/HeartLang 公開取得。 +摘要:由於 perovskite 太陽能電池 (PSC) 快速進展,導致研究出版物呈指數成長,迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先,我們開發出 Perovskite-KG,一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次,我們建立兩個互補的資料集:Perovskite-Chat,包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對;以及 Perovskite-Reasoning,包含 2,217 個仔細策展的材料科學問題。第三,我們推出兩個專門化大型語言模型:針對領域特定知識協助的 Perovskite-Chat-LLM,以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示,我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型,為研究人員提供有效的工具,用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。 -##### **Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction** -2502.10689v1 by Leisheng Yu, Yanxiao Cai, Minxing Zhang, Xia Hu +##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation** +2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li -The burgeoning volume of electronic health records (EHRs) has enabled deep -learning models to excel in predictive healthcare. However, for high-stakes -applications such as diagnosis prediction, model interpretability remains -paramount. Existing deep learning diagnosis prediction models with intrinsic -interpretability often assign attention weights to every past diagnosis or -hospital visit, providing explanations lacking flexibility and succinctness. In -this paper, we introduce SHy, a self-explaining hypergraph neural network -model, designed to offer personalized, concise and faithful explanations that -allow for interventions from clinical experts. By modeling each patient as a -unique hypergraph and employing a message-passing mechanism, SHy captures -higher-order disease interactions and extracts distinct temporal phenotypes as -personalized explanations. It also addresses the incompleteness of the EHR data -by accounting for essential false negatives in the original diagnosis record. A -qualitative case study and extensive quantitative evaluations on two real-world -EHR datasets demonstrate the superior predictive performance and -interpretability of SHy over existing state-of-the-art models. +Explainable recommendation has demonstrated significant advantages in +informing users about the logic behind recommendations, thereby increasing +system transparency, effectiveness, and trustworthiness. To provide +personalized and interpretable explanations, existing works often combine the +generation capabilities of large language models (LLMs) with collaborative +filtering (CF) information. CF information extracted from the user-item +interaction graph captures the user behaviors and preferences, which is crucial +for providing informative explanations. However, due to the complexity of graph +structure, effectively extracting the CF information from graphs still remains +a challenge. Moreover, existing methods often struggle with the integration of +extracted CF information with LLMs due to its implicit representation and the +modality gap between graph structures and natural language explanations. To +address these challenges, we propose G-Refer, a framework using graph +retrieval-augmented large language models (LLMs) for explainable +recommendation. Specifically, we first employ a hybrid graph retrieval +mechanism to retrieve explicit CF signals from both structural and semantic +perspectives. The retrieved CF information is explicitly formulated as +human-understandable text by the proposed graph translation and accounts for +the explanations generated by LLMs. To bridge the modality gap, we introduce +knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of +LLMs to process and utilize the retrieved CF information to generate +explanations. Extensive experiments show that G-Refer achieves superior +performance compared with existing methods in both explainability and +stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer. -摘要:隨著電子健康紀錄 (EHR) 數量的激增,深度學習模型在預測保健方面表現出色。然而,對於診斷預測等高風險應用,模型的可解釋性仍然至關重要。現有的具有內在可解釋性的深度學習診斷預測模型通常會為每個過去的診斷或醫院就診分配注意力權重,提供的解釋缺乏靈活性且簡潔性。在本文中,我們介紹了 SHy,這是一個自解釋的超圖神經網路模型,旨在提供個性化、簡潔且忠實的解釋,讓臨床專家可以進行干預。通過將每個患者建模為一個獨特的超圖並採用訊息傳遞機制,SHy 捕捉到了高階疾病交互作用,並提取出不同的時間表型作為個性化解釋。它還通過考慮原始診斷記錄中的基本假陰性來解決電子健康紀錄資料的不完整性。對兩個真實世界電子健康紀錄資料集進行的定性案例研究和廣泛的定量評估表明,SHy 在預測效能和可解釋性方面優於現有的最先進模型。 +摘要:可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點,從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明,現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好,這對於提供資訊性說明至關重要。然而,由於圖形結構的複雜性,從圖形中有效提取 CF 資訊仍然是一個挑戰。此外,現有方法通常難以將提取的 CF 資訊與 LLM 整合,因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰,我們提出 G-Refer,一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說,我們首先採用混合圖形檢索機制,從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字,並說明 LLM 生成的解釋。為了彌合模式差距,我們引入了知識修剪和檢索增強微調,以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明,與現有方法相比,G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。 -##### **ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis** -2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan +##### **A-MEM: Agentic Memory for LLM Agents** +2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang -Recent advancements in large language models (LLMs) have demonstrated -extraordinary comprehension capabilities with remarkable breakthroughs on -various vision-language tasks. However, the application of LLMs in generating -reliable medical diagnostic reports remains in the early stages. Currently, -medical LLMs typically feature a passive interaction model where doctors -respond to patient queries with little or no involvement in analyzing medical -images. In contrast, some ChatBots simply respond to predefined queries based -on visual inputs, lacking interactive dialogue or consideration of medical -history. As such, there is a gap between LLM-generated patient-ChatBot -interactions and those occurring in actual patient-doctor consultations. To -bridge this gap, we develop an LLM-based dialogue system, namely proactive -multi-round vision-language interactions for computer-aided diagnosis -(ProMRVL-CAD), to generate patient-friendly disease diagnostic reports. The -proposed ProMRVL-CAD system allows proactive dialogue to provide patients with -constant and reliable medical access via an integration of knowledge graph into -a recommendation system. Specifically, we devise two generators: a Proactive -Question Generator (Pro-Q Gen) to generate proactive questions that guide the -diagnostic procedure and a Multi-Vision Patient-Text Diagnostic Report -Generator (MVP-DR Gen) to produce high-quality diagnostic reports. Evaluating -two real-world publicly available datasets, MIMIC-CXR and IU-Xray, our model -has better quality in generating medical reports. We further demonstrate the -performance of ProMRVL achieves robust under the scenarios with low image -quality. Moreover, we have created a synthetic medical dialogue dataset that -simulates proactive diagnostic interactions between patients and doctors, -serving as a valuable resource for training LLM. +While large language model (LLM) agents can effectively use external tools +for complex real-world tasks, they require memory systems to leverage +historical experiences. Current memory systems enable basic storage and +retrieval but lack sophisticated memory organization, despite recent attempts +to incorporate graph databases. Moreover, these systems' fixed operations and +structures limit their adaptability across diverse tasks. To address this +limitation, this paper proposes a novel agentic memory system for LLM agents +that can dynamically organize memories in an agentic way. Following the basic +principles of the Zettelkasten method, we designed our memory system to create +interconnected knowledge networks through dynamic indexing and linking. When a +new memory is added, we generate a comprehensive note containing multiple +structured attributes, including contextual descriptions, keywords, and tags. +The system then analyzes historical memories to identify relevant connections, +establishing links where meaningful similarities exist. Additionally, this +process enables memory evolution - as new memories are integrated, they can +trigger updates to the contextual representations and attributes of existing +historical memories, allowing the memory network to continuously refine its +understanding. Our approach combines the structured organization principles of +Zettelkasten with the flexibility of agent-driven decision making, allowing for +more adaptive and context-aware memory management. Empirical experiments on six +foundation models show superior improvement against existing SOTA baselines. +The source code is available at https://github.com/WujiangXu/AgenticMemory. -摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。 +摘要:大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務,但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索,但缺乏精密的記憶體組織,儘管最近嘗試納入圖形資料庫。此外,這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制,本文提出了一種新的代理記憶體系統,供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則,我們設計我們的記憶體系統,透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時,我們會產生包含多個結構化屬性的綜合筆記,包括脈絡描述、關鍵字和標籤。然後,系統會分析歷史記憶體以找出相關連結,在有意義的相似性時建立連結。此外,這個程序能讓記憶體演化,因為當整合新的記憶體時,它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新,讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性,能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。 -##### **Optimizing CNN Architectures for Advanced Thoracic Disease Classification** -2502.10614v1 by Tejas Mirthipati +##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs** +2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li -Machine learning, particularly convolutional neural networks (CNNs), has -shown promise in medical image analysis, especially for thoracic disease -detection using chest X-ray images. In this study, we evaluate various CNN -architectures, including binary classification, multi-label classification, and -ResNet50 models, to address challenges like dataset imbalance, variations in -image quality, and hidden biases. We introduce advanced preprocessing -techniques such as principal component analysis (PCA) for image compression and -propose a novel class-weighted loss function to mitigate imbalance issues. Our -results highlight the potential of CNNs in medical imaging but emphasize that -issues like unbalanced datasets and variations in image acquisition methods -must be addressed for optimal model performance. +Large language models (LLMs) have demonstrated remarkable capabilities in +various complex tasks, yet they still suffer from hallucinations. Introducing +external knowledge, such as knowledge graph, can enhance the LLMs' ability to +provide factual answers. LLMs have the ability to interactively explore +knowledge graphs. However, most approaches have been affected by insufficient +internal knowledge excavation in LLMs, limited generation of trustworthy +knowledge reasoning paths, and a vague integration between internal and +external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large +model framework driven by the collaboration of internal and external knowledge. +It relies on the internal knowledge of the LLM to guide the exploration of +interpretable directed subgraphs in external knowledge graphs, better +integrating the two knowledge sources for more accurate reasoning. Extensive +experiments on multiple real-world datasets confirm the superiority of +KnowPath. -摘要:機器學習,特別是卷積神經網路 (CNN) 已在醫學影像分析中展現出潛力,特別是使用胸部 X 光影像進行胸腔疾病偵測。在此研究中,我們評估各種 CNN 架構,包括二元分類、多標籤分類和 ResNet50 模型,以解決資料集不平衡、影像品質差異和隱藏偏差等挑戰。我們導入進階前處理技術,例如主成分分析 (PCA) 以進行影像壓縮,並提出一個新穎的類別加權損失函數來緩解不平衡問題。我們的結果突顯了 CNN 在醫學影像中的潛力,但強調必須解決資料集不平衡和影像擷取方法差異等問題,才能獲得最佳模型效能。 +摘要:大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力,但仍會出現幻覺。引入外部知識(例如知識圖譜)可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而,大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限,以及內部和外部知識之間的整合模糊的影響。因此,我們提出 KnowPath,這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索,更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。 -##### **PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation** -2502.10536v1 by Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner +##### **Atom of Thoughts for Markov LLM Test-Time Scaling** +2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo -The interpretation of histopathology cases underlies many important -diagnostic and treatment decisions in medicine. Notably, this process typically -requires pathologists to integrate and summarize findings across multiple -slides per case. Existing vision-language capabilities in computational -pathology have so far been largely limited to small regions of interest, larger -regions at low magnification, or single whole-slide images (WSIs). This limits -interpretation of findings that span multiple high-magnification regions across -multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model -(LMM) with a 1-million token context window, we demonstrate the ability to -generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches -from multiple WSIs at 10X magnification. This is the equivalent of up to 11 -hours of video at 1 fps. Expert pathologist evaluations demonstrate that the -generated report text is clinically accurate and equivalent to or preferred -over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide -examples with up to 5 slides. While performance decreased for examples with 6 -or more slides, this study demonstrates the promise of leveraging the -long-context capabilities of modern LMMs for the uniquely challenging task of -medical report generation where each case can contain thousands of image -patches. +Large Language Models (LLMs) achieve superior performance through +training-time scaling, and test-time scaling further enhances their +capabilities by conducting effective reasoning during inference. However, as +the scale of reasoning increases, existing test-time scaling methods suffer +from accumulated historical information, which not only wastes computational +resources but also interferes with effective reasoning. To address this issue, +we observe that complex reasoning progress is often achieved by solving a +sequence of independent subquestions, each being self-contained and verifiable. +These subquestions are essentially atomic questions, relying primarily on their +current state rather than accumulated history, similar to the memoryless +transitions in a Markov process. Based on this observation, we propose Atom of +Thoughts (AoT), where each state transition in the reasoning process consists +of decomposing the current question into a dependency-based directed acyclic +graph and contracting its subquestions, forming a new atomic question state. +This iterative decomposition-contraction process continues until reaching +directly solvable atomic questions, naturally realizing Markov transitions +between question states. Furthermore, these atomic questions can be seamlessly +integrated into existing test-time scaling methods, enabling AoT to serve as a +plug-in enhancement for improving reasoning capabilities. Experiments across +six benchmarks demonstrate the effectiveness of AoT both as a standalone +framework and a plug-in enhancement. Notably, on HotpotQA, when applied to +gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and +DeepSeek-R1 by 10.6%. The code will be available at +https://github.com/qixucen/atom. -摘要:組織病理學病例的解讀是許多重要的醫學診斷和治療決策的基礎。值得注意的是,這個過程通常需要病理學家整合和總結每個病例的許多玻片中的發現。迄今為止,計算機病理學中現有的視覺語言功能在很大程度上僅限於小範圍的感興趣區域、低倍率下的較大區域或單一的全玻片影像 (WSI)。這限制了跨多個 WSI 中多個高倍率區域的發現的解讀。通過使用 Gemini 1.5 Flash,一個具有 100 萬個令牌上下文視窗的大型多模態模型 (LMM),我們展示了從多個 WSI 中多達 40,000 個 768x768 像素圖像貼片(10 倍放大)生成底線診斷的能力。這相當於 1 fps 下長達 11 小時的影片。專家病理學家評估表明,生成的報告文字在臨床上是準確的,並且等同於或優於 68%(95% CI:[60%,76%])的多玻片範例(最多 5 個玻片)的原始報告。儘管對於有 6 個或更多玻片的範例,其性能下降,但這項研究證明了利用現代 LMM 的長上下文功能來應對獨特挑戰性的醫療報告生成任務,其中每個病例可能包含數千個影像貼片,這項任務的前景。 +摘要:大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能,而測試時間擴充透過在推論期間進行有效的推理,進一步提升其能力。然而,隨著推理規模的擴大,現有的測試時間擴充方法會受到累積的歷史資訊影響,這不僅會浪費運算資源,還會干擾有效的推理。為了解決這個問題,我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成,每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題,主要依賴於它們的當前狀態,而不是累積的歷史,類似於馬可夫過程中的無記憶轉換。基於這個觀察,我們提出了思想原子 (AoT),其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖,並收縮其子問題,形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行,直到達到可直接解決的原子問題,自然地實現問題狀態之間的馬可夫轉換。此外,這些原子問題可以無縫整合到現有的測試時間擴充方法中,讓 AoT 可以作為外掛程式強化功能,以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是,在 HotpotQA 上,當應用於 gpt-4o-mini 時,AoT 達到了 80.6% 的 F1 分數,比 o3-mini 高出 3.4%,比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。 -##### **Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks** -2502.10526v2 by Venkatesh Sivaraman, Anika Vaishampayan, Xiaotong Li, Brian R Buck, Ziyong Ma, Richard D Boyce, Adam Perer +##### **Generating Text from Uniform Meaning Representation** +2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein -Temporal predictive models have the potential to improve decisions in health -care, public services, and other domains, yet they often fail to effectively -support decision-makers. Prior literature shows that many misalignments between -model behavior and decision-makers' expectations stem from issues of model -specification, namely how, when, and for whom predictions are made. However, -model specifications for predictive tasks are highly technical and difficult -for non-data-scientist stakeholders to interpret and critique. To address this -challenge we developed Tempo, an interactive system that helps data scientists -and domain experts collaboratively iterate on model specifications. Using -Tempo's simple yet precise temporal query language, data scientists can quickly -prototype specifications with greater transparency about pre-processing -choices. Moreover, domain experts can assess performance within data subgroups -to validate that models behave as expected. Through three case studies, we -demonstrate how Tempo helps multidisciplinary teams quickly prune infeasible -specifications and identify more promising directions to explore. +Uniform Meaning Representation (UMR) is a recently developed graph-based +semantic representation, which expands on Abstract Meaning Representation (AMR) +in a number of ways, in particular through the inclusion of document-level +information and multilingual flexibility. In order to effectively adopt and +leverage UMR for downstream tasks, efforts must be placed toward developing a +UMR technological ecosystem. Though still limited amounts of UMR annotations +have been produced to date, in this work, we investigate the first approaches +to producing text from multilingual UMR graphs: (1) a pipeline conversion of +UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large +language models with UMR data, and (3) fine-tuning existing AMR-to-text +generation models with UMR data. Our best performing model achieves a +multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared +to the reference, which is a promising indication of the effectiveness of +fine-tuning approaches for UMR-to-text generation with even limited amounts of +UMR data. -摘要:時序預測模型有潛力改善醫療保健、公共服務和其他領域的決策,但它們經常無法有效支援決策者。先前的文獻顯示,模型行為與決策者期望之間的許多不一致源自於模型規範問題,也就是如何、何時以及針對誰進行預測。然而,預測任務的模型規範非常技術化,非數據科學家利害關係人難以解讀和批評。為了應對此挑戰,我們開發了 Tempo,一個互動式系統,可協助數據科學家和領域專家協同反覆運算模型規範。透過使用 Tempo 簡單但精確的時序查詢語言,數據科學家可以快速建構規範原型,並更透明地了解前處理的選擇。此外,領域專家可以評估資料子群組內的效能,以驗證模型是否如預期般運作。透過三個案例研究,我們展示 Tempo 如何協助跨領域團隊快速刪減不可行的規範,並找出更有希望探索的方向。 +摘要:統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示,它在許多方面擴展了抽象語意表示 (AMR),特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR,必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限,但在這項工作中,我們探討了從多語言 UMR 圖形產生文字的第一種方法:(1) 將 UMR 轉換為 AMR 的管道,然後使用 AMR 轉文字生成模型,(2) 使用 UMR 資料微調大型語言模型,以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比,我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數,在中文中達到 0.882,這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果,即使 UMR 資料數量有限。 -##### **A Robust Attack: Displacement Backdoor Attack** -2502.10490v1 by Yong Li, Han Gao +##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs** +2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han -As artificial intelligence becomes more prevalent in our lives, people are -enjoying the convenience it brings, but they are also facing hidden threats, -such as data poisoning and adversarial attacks. These threats can have -disastrous consequences for the application of artificial intelligence, -especially for some applications that take effect immediately, such as -autonomous driving and medical fields. Among these threats, backdoor attacks -have left a deep impression on people with their concealment and simple -deployment, making them a threat that cannot be ignored, however, in the -process of deploying the backdoor model, the backdoor attack often has some -reasons that make it unsatisfactory in real-world applications, such as jitter -and brightness changes. Based on this, we propose a highly robust backdoor -attack that shifts the target sample and combines it with itself to form a -backdoor sample, the Displacement Backdoor Attack(DBA). Experimental results -show that the DBA attack can resist data augmentation that simulates real-world -differences, such as rotation and cropping. +The rapid development of Multimodal Large Language Models (MLLMs) has enabled +the integration of multiple modalities, including texts and images, within the +large language model (LLM) framework. However, texts and images are usually +interconnected, forming a multimodal attributed graph (MMAG). It is +underexplored how MLLMs can incorporate the relational information +(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts +and images) on such graphs for multimodal comprehension and generation. In this +paper, we propose GraphGPT-o, which supports omni-multimodal understanding and +creation on MMAGs. We first comprehensively study linearization variants to +transform semantic and structural information as input for MLLMs. Then, we +propose a hierarchical aligner that enables deep graph encoding, bridging the +gap between MMAGs and MLLMs. Finally, we explore the inference choices, +adapting MLLM to interleaved text and image generation in graph scenarios. +Extensive experiments on three datasets from different domains demonstrate the +effectiveness of our proposed method. Datasets and codes will be open-sourced +upon acceptance. -摘要:随着人工智能在我们的生活中变得越来越普遍,人们正在享受它带来的便利,但也面临着隐藏的威胁,例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果,特别是对于一些立即生效的应用,例如自动驾驶和医疗领域。在这些威胁中,后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象,使其成为不可忽视的威胁,然而,在部署后门模型的过程中,后门攻击往往存在一些使其在实际应用中不尽如人意的原因,例如抖动和亮度变化。基于此,我们提出了一种高度鲁棒的后门攻击,该攻击对目标样本进行平移并将其与自身结合以形成后门样本,即置换后门攻击 (DBA)。实验结果表明,DBA 攻击可以抵抗模拟真实世界差异的数据增强,例如旋转和裁剪。 +摘要:多模态大语言模型 (MLLM) 的快速发展,促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而,文本和图像通常是相互关联的,形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息(即图结构)和语义信息(即文本和图像)以进行多模态理解和生成,目前仍未得到充分探索。在本文中,我们提出了 GraphGPT-o,它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体,以将语义和结构信息转换为 MLLM 的输入。然后,我们提出了一个分层对齐器,它支持深度图编码,弥合了 MMAG 和 MLLM 之间的差距。最后,我们探索了推理选择,使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。 -##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** -2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker +##### **Exploring LLM-based Student Simulation for Metacognitive Cultivation** +2502.11678v1 by Haoxuan Li, Jifan Yu, Xin Cong, Yang Dang, Yisi Zhan, Huiqin Liu, Zhiyuan Liu -Explainability remains a significant problem for AI models in medical -imaging, making it challenging for clinicians to trust AI-driven predictions. -We introduce 3D ReX, the first causality-based post-hoc explainability tool for -3D models. 3D ReX uses the theory of actual causality to generate -responsibility maps which highlight the regions most crucial to the model's -decision. We test 3D ReX on a stroke detection model, providing insight into -the spatial distribution of features relevant to stroke. +Metacognitive education plays a crucial role in cultivating students' +self-regulation and reflective thinking, providing essential support for those +with learning difficulties through academic advising. Simulating students with +insufficient learning capabilities using large language models offers a +promising approach to refining pedagogical methods without ethical concerns. +However, existing simulations often fail to authentically represent students' +learning struggles and face challenges in evaluation due to the lack of +reliable metrics and ethical constraints in data collection. To address these +issues, we propose a pipeline for automatically generating and filtering +high-quality simulated student agents. Our approach leverages a two-round +automated scoring system validated by human experts and employs a score +propagation module to obtain more consistent scores across the student graph. +Experimental results demonstrate that our pipeline efficiently identifies +high-quality student agents, and we discuss the traits that influence the +simulation's effectiveness. By simulating students with varying degrees of +learning difficulties, our work paves the way for broader applications in +personalized learning and educational assessment. -摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 -我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 +摘要:元認知教育在培養學生的自我調節和反思性思考中發揮著至關重要的作用,通過學術諮詢為有學習困難的人提供必要的支持。使用大型語言模型模擬學習能力不足的學生提供了一種有前途的方法,可以在沒有道德問題的情況下改進教學方法。然而,現有的模擬通常無法真實地反映學生的學習困難,並且由於缺乏可靠的指標和數據收集中的道德約束,在評估中面臨挑戰。為了解決這些問題,我們提出了一個自動生成和過濾高質量模擬學生代理的管道。我們的做法利用了由人類專家驗證的兩輪自動評分系統,並採用分數傳播模組來獲得跨學生圖表更一致的分數。實驗結果表明,我們的管道有效地識別了高質量的學生代理,並且我們討論了影響模擬效果的特質。通過模擬具有不同程度學習困難的學生,我們的研究為個性化學習和教育評估中的更廣泛應用鋪平了道路。 -##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model** -2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott +##### **Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering** +2502.11491v1 by Runxuan Liu, Bei Luo, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin -In the analysis of remote healthcare monitoring data, time series -representation learning offers substantial value in uncovering deeper patterns -of patient behavior, especially given the fine temporal granularity of the -data. In this study, we focus on a dataset of home activity records from people -living with Dementia. We propose a two-stage self-supervised learning approach. -The first stage involves converting time-series activities into text strings, -which are then encoded by a fine-tuned language model. In the second stage, -these time-series vectors are bi-dimensionalized for applying PageRank method, -to analyze latent state transitions to quantitatively assess participants -behavioral patterns and identify activity biases. These insights, combined with -diagnostic data, aim to support personalized care interventions. +Large language models (LLMs) have shown remarkable capabilities in natural +language processing. However, in knowledge graph question answering tasks +(KGQA), there remains the issue of answering questions that require multi-hop +reasoning. Existing methods rely on entity vector matching, but the purpose of +the question is abstract and difficult to match with specific entities. As a +result, it is difficult to establish reasoning paths to the purpose, which +leads to information loss and redundancy. To address this issue, inspired by +human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a +novel framework that constructs reasoning paths from purposes back to +conditions. ORT operates in three key phases: (1) using LLM to extract purpose +labels and condition labels, (2) constructing label reasoning paths based on +the KG ontology, and (3) using the label reasoning paths to guide knowledge +retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves +state-of-the-art performance and significantly enhances the capability of LLMs +for KGQA. -摘要:在遠程醫療監控數據分析中,時序表示學習在揭示患者行為的更深層模式方面提供了實質性的價值,特別是考慮到數據的精細時間粒度。在本研究中,我們專注於痴呆症患者居家活動記錄的數據集。我們提出了一種兩階段的自我監督學習方法。第一階段涉及將時序活動轉換為文本串,然後由微調語言模型編碼。在第二階段,這些時序向量被雙維化以應用 PageRank 方法,分析潛在狀態轉換以定量評估參與者的行為模式並識別活動偏差。這些見解與診斷數據相結合,旨在支持個性化護理干預。 +摘要:大型語言模型 (LLM) 在自然語言處理中展現出卓越的能力。然而,在知識圖譜問答任務 (KGQA) 中,仍然存在需要多跳推理才能回答問題的問題。現有方法依賴於實體向量匹配,但問題的目的是抽象的,難以與特定實體匹配。因此,很難建立推理路徑來達成目的,這會導致資訊遺失和冗餘。為了解決這個問題,在人類逆向思維的啟發下,我們提出了基於本体的逆向思維 (ORT),這是一個創新的架構,可以從目的建構推理路徑,再回推到條件。ORT 運作在三個關鍵階段:(1) 使用 LLM 萃取目的標籤和條件標籤,(2) 基於 KG 本体建構標籤推理路徑,以及 (3) 使用標籤推理路徑來引導知識擷取。在 WebQSP 和 CWQ 資料集上的實驗顯示,ORT 達到了最先進的效能,並顯著增強了 LLM 對 KGQA 的能力。 -##### **TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation** -2502.09931v1 by Ju-Hyeon Nam, Nur Suriza Syazwany, Sang-Chul Lee +##### **GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion** +2502.11471v1 by Kangyang Luo, Yuzhuo Bai, Cheng Gao, Shuzheng Si, Yingli Shen, Zhu Liu, Zhitong Wang, Cunliang Kong, Wenhao Li, Yufei Huang, Ye Tian, Xuantang Xiong, Lei Han, Maosong Sun -Skip connection engineering is primarily employed to address the semantic gap -between the encoder and decoder, while also integrating global dependencies to -understand the relationships among complex anatomical structures in medical -image segmentation. Although several models have proposed transformer-based -approaches to incorporate global dependencies within skip connections, they -often face limitations in capturing detailed local features with high -computational complexity. In contrast, graph neural networks (GNNs) exploit -graph structures to effectively capture local and global features. Leveraging -these properties, we introduce an attentional cross-scale graph neural network -(ACS-GNN), which enhances the skip connection framework by converting -cross-scale feature maps into a graph structure and capturing complex -anatomical structures through node attention. Additionally, we observed that -deep learning models often produce uninformative feature maps, which degrades -the quality of spatial attention maps. To address this problem, we integrated -entropy-driven feature selection (EFS) with spatial attention, calculating an -entropy score for each channel and filtering out high-entropy feature maps. Our -innovative framework, TransGUNet, comprises ACS-GNN and EFS-based spatial -attentio} to effectively enhance domain generalizability across various -modalities by leveraging GNNs alongside a reliable spatial attention map, -ensuring more robust features within the skip connection. Through comprehensive -experiments and analysis, TransGUNet achieved superior segmentation performance -on six seen and eight unseen datasets, demonstrating significantly higher -efficiency compared to previous methods. +Knowledge Graph Completion (KGC), which aims to infer missing or incomplete +facts, is a crucial task for KGs. However, integrating the vital structural +information of KGs into Large Language Models (LLMs) and outputting predictions +deterministically remains challenging. To address this, we propose a new method +called GLTW, which encodes the structural information of KGs and merges it with +LLMs to enhance KGC performance. Specifically, we introduce an improved Graph +Transformer (iGT) that effectively encodes subgraphs with both local and global +structural information and inherits the characteristics of language model, +bypassing training from scratch. Also, we develop a subgraph-based +multi-classification training objective, using all entities within KG as +classification objects, to boost learning efficiency.Importantly, we combine +iGT with an LLM that takes KG language prompts as input.Our extensive +experiments on various KG datasets show that GLTW achieves significant +performance gains compared to SOTA baselines. -摘要:跳躍連接工程主要用於解決編碼器和解碼器之間的語義鴻溝,同時還整合全局依賴關係以了解醫學影像分割中複雜解剖結構之間的關係。儘管有幾個模型提出了基於Transformer的架構來整合跳躍連接中的全局依賴關係,但它們在以高計算複雜度擷取詳細的局部特徵時常常面臨限制。相比之下,圖神經網路 (GNN) 利用圖結構有效擷取局部和全局特徵。利用這些屬性,我們引入了注意力跨尺度圖神經網路 (ACS-GNN),它通過將跨尺度特徵圖轉換為圖結構並通過節點注意力擷取複雜的解剖結構來增強跳躍連接框架。此外,我們觀察到深度學習模型通常會產生無意義的特徵圖,這會降低空間注意力圖的品質。為了解決這個問題,我們將熵驅動特徵選擇 (EFS) 與空間注意力整合在一起,為每個通道計算熵分數並濾出高熵特徵圖。我們創新的框架 TransGUNet 包含 ACS-GNN 和基於 EFS 的空間注意力,通過利用 GNN 以及可靠的空間注意力圖有效增強跨各種模態的域泛化能力,確保跳躍連接中更強大的特徵。透過全面的實驗和分析,TransGUNet 在六個已見和八個未見的資料集上實現了優異的分割效能,證明與先前的方法相比,效率顯著提高。 +摘要:知識圖譜補全 (KGC) 旨在推論遺失或不完整的 +事實,是 KGs 的一項關鍵任務。然而,將 KGs 的重要結構 +資訊整合至大型語言模型 (LLM),並確定性地輸出預測結果,仍然是一項挑戰。為了解決這個問題,我們提出了一種新的方法,稱為 GLTW,它編碼了 KGs 的結構資訊,並將其與 LLM 合併,以增強 KGC 的效能。具體來說,我們引進了一個改良的圖形轉換器 (iGT),它能有效地編碼具有局部和全域結構資訊的子圖,並繼承語言模型的特徵,繞過從頭開始的訓練。此外,我們開發了一個基於子圖的多分類訓練目標,使用 KG 中的所有實體作為 +分類物件,以提升學習效率。重要的是,我們將 iGT 與一個將 KG 語言提示作為輸入的 LLM 結合起來。我們在各種 KG 資料集上進行的廣泛實驗顯示,與 SOTA 基準線相比,GLTW 獲得了顯著的效能提升。 -##### **Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos** -2502.09886v1 by Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, Pieter Abbeel +##### **Large Language-Geometry Model: When LLM meets Equivariance** +2502.11149v2 by Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao -Simulation offers a promising approach for cheaply scaling training data for -generalist policies. To scalably generate data from diverse and realistic -tasks, existing algorithms either rely on large language models (LLMs) that may -hallucinate tasks not interesting for robotics; or digital twins, which require -careful real-to-sim alignment and are hard to scale. To address these -challenges, we introduce Video2Policy, a novel framework that leverages -internet RGB videos to reconstruct tasks based on everyday human behavior. Our -approach comprises two phases: (1) task generation in simulation from videos; -and (2) reinforcement learning utilizing in-context LLM-generated reward -functions iteratively. We demonstrate the efficacy of Video2Policy by -reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, -which depicts diverse and complex human behaviors on 9 different tasks. Our -method can successfully train RL policies on such tasks, including complex and -challenging tasks such as throwing. Finally, we show that the generated -simulation data can be scaled up for training a general policy, and it can be -transferred back to the real robot in a Real2Sim2Real way. +Accurately predicting 3D structures and dynamics of physical systems is +crucial in scientific applications. Existing approaches that rely on geometric +Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, +but they often fall in leveraging extensive broader information. While direct +application of Large Language Models (LLMs) can incorporate external knowledge, +they lack the capability for spatial reasoning with guaranteed equivariance. In +this paper, we propose EquiLLM, a novel framework for representing 3D physical +systems that seamlessly integrates E(3)-equivariance with LLM capabilities. +Specifically, EquiLLM comprises four key components: geometry-aware prompting, +an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the +LLM guided by the instructive prompt serves as a sophisticated invariant +feature processor, while 3D directional information is exclusively handled by +the equivariant encoder and adaptor modules. Experimental results demonstrate +that EquiLLM delivers significant improvements over previous methods across +molecular dynamics simulation, human motion simulation, and antibody design, +highlighting its promising generalizability. -摘要:模擬提供了一種有前途的方法,可以用於擴展訓練資料,以制定通才政策。為了從多樣化且逼真的任務中可擴充地產生資料,現有演算法仰賴大型語言模型 (LLM),這些模型可能會產生對機器人技術不感興趣的任務;或者仰賴數位雙胞胎,這需要仔細地將真實環境與模擬環境對齊,而且很難擴充。為了應對這些挑戰,我們引入了 Video2Policy,這是一個新穎的架構,它利用網路上的 RGB 影片,根據日常人類行為來重建任務。我們的做法包含兩個階段:(1) 從影片中在模擬環境中產生任務;以及 (2) 利用在情境中由 LLM 產生的獎勵函數,反覆進行強化學習。我們透過重建 Something-Something-v2 (SSv2) 資料集中的 100 多個影片來展示 Video2Policy 的效能,這些影片描繪了 9 項不同任務中多樣化且複雜的人類行為。我們的做法可以在這些任務上成功訓練 RL 政策,包括複雜且具挑戰性的任務,例如投擲。最後,我們展示了產生的模擬資料可以擴充到訓練一般政策,而且可以透過 Real2Sim2Real 的方式轉移回真實機器人。 +摘要:準確預測物理系統的 3D 結構和動力學在科學應用中至關重要。現有依賴於幾何圖神經網路 (GNN) 的方法有效地強制執行了 $\mathrm{E}(3)$-等變性,但它們通常無法利用廣泛的更廣泛資訊。儘管大型語言模型 (LLM) 的直接應用可以納入外部知識,但它們缺乏保證等變性的空間推理能力。在本文中,我們提出了 EquiLLM,一個用於表示 3D 物理系統的新框架,它將 E(3)-等變性與 LLM 能力無縫整合。具體來說,EquiLLM 包含四個關鍵組成部分:感知幾何的提示、等變編碼器、LLM 和等變適配器。從本質上講,由指導性提示引導的 LLM 作為一個複雜的不變特徵處理器,而 3D 方向資訊則由等變編碼器和適配器模組獨家處理。實驗結果表明,EquiLLM 在分子動力學模擬、人類運動模擬和抗體設計方面比以前的方法有了顯著的改進,突顯了其有希望的泛化能力。 -##### **HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation** -2502.09838v2 by Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi +##### **Beyond Pairwise: Global Zero-shot Temporal Graph Generation** +2502.11114v1 by Alon Eirew, Kfir Bar, Ido Dagan -We present HealthGPT, a powerful Medical Large Vision-Language Model -(Med-LVLM) that integrates medical visual comprehension and generation -capabilities within a unified autoregressive paradigm. Our bootstrapping -philosophy is to progressively adapt heterogeneous comprehension and generation -knowledge to pre-trained large language models (LLMs). This is achieved through -a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is -complemented by a tailored hierarchical visual perception approach and a -three-stage learning strategy. To effectively learn the HealthGPT, we devise a -comprehensive medical domain-specific comprehension and generation dataset -called VL-Health. Experimental results demonstrate exceptional performance and -scalability of HealthGPT in medical visual unified tasks. Our project can be -accessed at https://github.com/DCDmllm/HealthGPT. +Temporal relation extraction (TRE) is a fundamental task in natural language +processing (NLP) that involves identifying the temporal relationships between +events in a document. Despite the advances in large language models (LLMs), +their application to TRE remains limited. Most existing approaches rely on +pairwise classification, in which event pairs are considered individually, +leading to computational inefficiency and a lack of global consistency in the +resulting temporal graph. In this work, we propose a novel zero-shot method for +TRE that generates a document's complete temporal graph at once, then applies +transitive constraints optimization to refine predictions and enforce temporal +consistency across relations. Additionally, we introduce OmniTemp, a new +dataset with complete annotations for all pairs of targeted events within a +document. Through experiments and analyses, we demonstrate that our method +significantly outperforms existing zero-shot approaches while achieving +competitive performance with supervised models. -摘要:我們提出 HealthGPT,一種強大的醫學大型視覺語言模型 (Med-LVLM),它整合了醫學視覺理解和生成能力於一個統一的自動迴歸範例中。我們的引導哲學是逐步調整異質理解和生成知識以預先訓練大型語言模型 (LLM)。這是通過一種新穎的異質低秩適應 (H-LoRA) 技術實現的,該技術由量身定制的分層視覺感知方法和三階段學習策略補充。為了有效學習 HealthGPT,我們設計了一個全面的醫學領域特定理解和生成數據集,稱為 VL-Health。實驗結果證明了 HealthGPT 在醫學視覺統一任務中的卓越性能和可擴展性。我們的項目可以在 https://github.com/DCDmllm/HealthGPT 中訪問。 +摘要:時間關係抽取 (TRE) 是自然語言處理 (NLP) 中的一項基本任務,涉及識別文件中事件之間的時間關係。儘管大型語言模型 (LLM) 取得進展,但它們在 TRE 中的應用仍然有限。現有的大多數方法依賴於成對分類,其中事件對被單獨考慮,導致計算效率低下且在生成的時序圖中缺乏全局一致性。在這項工作中,我們提出了一種新穎的 TRE 零次學習方法,它可以一次生成文件的完整時序圖,然後應用遞移約束最佳化來優化預測並強制關係之間的時間一致性。此外,我們引入了 OmniTemp,這是一個新的數據集,其中包含文件內所有目標事件對的完整註解。通過實驗和分析,我們證明了我們的方法明顯優於現有的零次學習方法,同時實現了與監督模型相當的性能。 -##### **Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games** -2502.09780v1 by Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi +##### **Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications** +2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy -Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of -applications involving the interaction of a group of agents in a shared unknown -environment. A prominent framework for studying MARL is Markov games, with the -goal of finding various notions of equilibria in a sample-efficient manner, -such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). -However, existing sample-efficient approaches either require tailored -uncertainty estimation under function approximation, or careful coordination of -the players. In this paper, we propose a novel model-based algorithm, called -VMG, that incentivizes exploration via biasing the empirical estimate of the -model parameters towards those with a higher collective best-response values of -all the players when fixing the other players' policies, thus encouraging the -policy to deviate from its current equilibrium for more exploration. VMG is -oblivious to different forms of function approximation, and permits -simultaneous and uncoupled policy updates of all players. Theoretically, we -also establish that VMG achieves a near-optimal regret for finding both the NEs -of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov -games under linear function approximation in an online environment, which -nearly match their counterparts with sophisticated uncertainty quantification. +Large language models (LLMs) have significantly advanced the field of natural +language generation. However, they frequently generate unverified outputs, +which compromises their reliability in critical applications. In this study, we +propose an innovative framework that combines structured biomedical knowledge +with LLMs through a retrieval-augmented generation technique. Our system +develops a thorough knowledge graph by identifying and refining causal +relationships and named entities from medical abstracts related to age-related +macular degeneration (AMD). Using a vector-based retrieval process and a +locally deployed language model, our framework produces responses that are both +contextually relevant and verifiable, with direct references to clinical +evidence. Experimental results show that this method notably decreases +hallucinations, enhances factual precision, and improves the clarity of +generated responses, providing a robust solution for advanced biomedical +chatbot applications. -摘要:多智能體強化學習 (MARL) 是一系列應用程式的心臟,這些應用程式涉及一群智能體在一個共用未知環境中的互動。研究 MARL 的一個著名框架是馬可夫博弈,其目標是用樣本有效率的方式找出各種均衡概念,例如納許均衡 (NE) 和粗相關均衡 (CCE)。然而,現有的樣本有效率方法需要在函數逼近下進行量身打造的不確定性估計,或謹慎協調參與者。在本文中,我們提出了一種新的基於模型的演算法,稱為 VMG,它透過將模型參數的經驗估計值偏向於在固定其他參與者政策時所有參與者的集體最佳反應值,從而激勵探索,進而鼓勵政策偏離其當前均衡以進行更多探索。VMG 不會忽略函數逼近的不同形式,並允許所有參與者同時進行非耦合的政策更新。在理論上,我們也建立了 VMG 在線上環境中使用線性函數逼近來尋找雙人零和馬可夫博弈的 NE 和多人一般和馬可夫博弈的 CCE 時,會獲得接近最佳的後悔,這幾乎與其在不確定性量化方面更為複雜的對應物相匹配。 +摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。 -##### **The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention** -2502.09757v1 by Bereket A. Yilma, Chan Mi Kim, Geke Ludden, Thomas van Rompay, Luis A. Leiva +##### **Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection** +2502.11062v1 by Yang Zhao, Li Du, Xiao Ding, Yangou Ouyang, Hepeng Wang, Kai Xiong, Jinglong Gao, Zhouhao Sun, Dongliang Xu, Yang Qing, Dongchen Li, Bing Qin, Ting Liu -Post-intensive care syndrome (PICS) is a multifaceted condition that arises -from prolonged stays in an intensive care unit (ICU). While preventing PICS -among ICU patients is becoming increasingly important, interventions remain -limited. Building on evidence supporting the effectiveness of art exposure in -addressing the psychological aspects of PICS, we propose a novel art therapy -solution through a collaborative Human-AI approach that enhances personalized -therapeutic interventions using state-of-the-art Visual Art Recommendation -Systems. We developed two Human-in-the-Loop (HITL) personalization methods and -assessed their impact through a large-scale user study (N=150). Our findings -demonstrate that this Human-AI collaboration not only enhances the -personalization and effectiveness of art therapy but also supports therapists -by streamlining their workload. While our study centres on PICS intervention, -the results suggest that human-AI collaborative Art therapy could potentially -benefit other areas where emotional support is critical, such as cases of -anxiety and depression. +Large language models (LLMs) have shown great potential across various +industries due to their remarkable ability to generalize through instruction +tuning. However, the limited availability of domain-specific data significantly +hampers their performance on specialized tasks. While existing methods +primarily focus on selecting training data from general datasets that are +similar to the target domain, they often fail to consider the joint +distribution of instructions, resulting in inefficient learning and suboptimal +knowledge transfer. To address these challenges, we introduce G2IS +(Gradient-based Graph Instruction Selection), a novel method that constructs a +mixed gradient-based instruction graph to capture the joint distribution and +interdependencies between instructions. By accounting for the relationships +between instructions, G2IS improves domain adaptation efficiency. Additionally, +we propose a gradient walk algorithm to refine the data selection process, +enhancing both training effectiveness and efficiency. Our experiments +demonstrate that G2IS outperforms traditional methods across various domain +adaptation tasks, yielding significant performance gains, particularly in +complex, data-scarce scenarios. These results underscore the potential of G2IS +in advancing the development of large, domain-specific models. -摘要:重症後症候群 (PICS) 是一種多面向的疾病,源自於在加護病房 (ICU) 長期住院。雖然預防重症後症候群在加護病房患者中正變得越來越重要,但介入措施仍然有限。建立在支持藝術接觸在解決重症後症候群心理層面的證據上,我們提出一個創新的藝術療法解決方案,透過協作式的人工智慧方法,使用最先進的視覺藝術推薦系統,增強個人化的治療介入。我們開發了兩種人機迴路 (HITL) 個人化方法,並透過大規模使用者研究 (N=150) 評估其影響。我們的發現證明,這種人機協作不僅增強了藝術治療的個人化和有效性,也透過簡化治療師的工作量來提供支援。雖然我們的研究中心在重症後症候群介入,但結果顯示,人機協作藝術療法有可能對其他需要情緒支持的領域有益,例如焦慮和憂鬱症。 +摘要:大型語言模型 (LLM) 因其透過指令微調而具備的卓越泛化能力,在各產業中展現出極大的潛力。然而,特定領域資料的取得有限,大幅影響其在專業任務上的表現。現有方法主要專注於從與目標領域類似的通用資料集中選取訓練資料,但它們通常未能考量指令的聯合分佈,導致學習效率不彰且知識傳遞不佳。為了應對這些挑戰,我們引進 G2IS(基於梯度的圖形指令選取),這是一種創新的方法,可建構一個混合的基於梯度的指令圖形,以擷取指令之間的聯合分佈和相互依賴性。透過考量指令之間的關係,G2IS 提升了領域適應的效率。此外,我們提出了一種梯度漫步演算法來優化資料選取程序,同時提升訓練效能和效率。我們的實驗證明,G2IS 在各種領域適應任務中優於傳統方法,產生顯著的效能提升,特別是在資料稀少的複雜場景中。這些結果突顯了 G2IS 在推動大型特定領域模型發展方面的潛力。 -##### **A CNN Approach to Automated Detection and Classification of Brain Tumors** -2502.09731v1 by Md. Zahid Hasan, Abdullah Tamim, D. M. Asadujjaman, Md. Mahfujur Rahman, Md. Abu Ahnaf Mollick, Nosin Anjum Dristi, Abdullah-Al-Noman +##### **CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models** +2502.11008v1 by Yuefei Chen, Vivek K. Singh, Jing Ma, Ruxiang Tang -Brain tumors require an assessment to ensure timely diagnosis and effective -patient treatment. Morphological factors such as size, location, texture, and -variable appearance complicate tumor inspection. Medical imaging presents -challenges, including noise and incomplete images. This research article -presents a methodology for processing Magnetic Resonance Imaging (MRI) data, -encompassing techniques for image classification and denoising. The effective -use of MRI images allows medical professionals to detect brain disorders, -including tumors. This research aims to categorize healthy brain tissue and -brain tumors by analyzing the provided MRI data. Unlike alternative methods -like Computed Tomography (CT), MRI technology offers a more detailed -representation of internal anatomical components, making it a suitable option -for studying data related to brain tumors. The MRI picture is first subjected -to a denoising technique utilizing an Anisotropic diffusion filter. The dataset -utilized for the models creation is a publicly accessible and validated Brain -Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE -was employed for data augmentation and dataset balancing. Convolutional Neural -Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for -the classification procedure. EfficientNet attained an accuracy of 98%, the -highest recorded. +Counterfactual reasoning is widely recognized as one of the most challenging +and intricate aspects of causality in artificial intelligence. In this paper, +we evaluate the performance of large language models (LLMs) in counterfactual +reasoning. In contrast to previous studies that primarily focus on commonsense +causal reasoning, where LLMs often rely on prior knowledge for inference, we +specifically assess their ability to perform counterfactual inference using a +set of formal rules. To support this evaluation, we introduce a new benchmark +dataset, CounterBench, comprising 1K counterfactual reasoning questions. The +dataset is designed with varying levels of difficulty, diverse causal graph +structures, distinct types of counterfactual questions, and multiple +nonsensical name variants. Our experiments demonstrate that counterfactual +reasoning poses a significant challenge for LLMs, with most models performing +at levels comparable to random guessing. To enhance LLM's counterfactual +reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides +LLMs through iterative reasoning and backtracking to systematically explore +counterfactual solutions. Experimental results show that our method +significantly improves LLM performance on counterfactual reasoning tasks and +consistently enhances performance across different LLMs.Our dataset is +available at https://huggingface.co/datasets/CounterBench/CounterBench. -摘要:腦腫瘤需要評估以確保及時診斷和有效的患者治療。大小、位置、質地和可變外觀等形態因素會使腫瘤檢查複雜化。醫學影像會呈現挑戰,包括雜訊和不完整的影像。本研究文章提出了一種處理磁共振影像 (MRI) 資料的方法,包含影像分類和去噪技術。有效使用 MRI 影像可讓醫護人員偵測腦部疾病,包括腫瘤。本研究旨在透過分析提供的 MRI 資料來分類健康的腦組織和腦瘤。與電腦斷層掃描 (CT) 等替代方法不同,MRI 技術提供了更詳細的內部解剖結構表示,使其成為研究與腦瘤相關資料的合適選擇。MRI 影像會先使用各向異性擴散濾波器進行去噪技術處理。用於建立模型的資料集是一個公開且經過驗證的腦腫瘤分類 (MRI) 資料庫,包含 3,264 個腦部 MRI 掃描。SMOTE 用於資料擴充和資料集平衡。卷積神經網路 (CNN),例如 ResNet152V2、VGG、ViT 和 EfficientNet,用於分類程序。EfficientNet 達到了 98% 的準確度,是記錄到的最高值。 +摘要:反事實推理被廣泛認為是人工智慧中因果關係最具挑戰性和複雜的面向之一。在本文中,我們評估大型語言模型 (LLM) 在反事實推理中的表現。與主要關注常識因果推理,其中 LLM 經常依賴先驗知識來進行推理的先前研究不同,我們特別評估它們使用一組形式規則執行反事實推理的能力。為了支持此評估,我們引入了一個新的基準資料集 CounterBench,其中包含 1K 個反事實推理問題。資料集的設計具有不同的難度等級、多樣化的因果圖結構、不同類型的反事實問題和多種無意義的名稱變體。我們的實驗表明,反事實推理對 LLM 構成重大挑戰,大多數模型的表現與隨機猜測相當。為了增強 LLM 的反事實推理能力,我們提出了一種新穎的推理範例 CoIn,它引導 LLM 透過反覆推理和回溯系統性地探索反事實解。實驗結果表明,我們的方法顯著提升 LLM 在反事實推理任務上的表現,並持續增強不同 LLM 的表現。我們的資料集可在 https://huggingface.co/datasets/CounterBench/CounterBench 取得。 -##### **Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data** -2502.09715v1 by Yu Leng, Yingnan He, Colin Magdamo, Ana-Maria Vranceanu, Christine S. Ritchie, Shibani S. Mukerji, Lidia M. V. R. Moura, John R. Dickson, Deborah Blacker, Sudeshna Das +##### **RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation** +2502.10996v1 by Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han -Identifying cognitive impairment within electronic health records (EHRs) is -crucial not only for timely diagnoses but also for facilitating research. -Information about cognitive impairment often exists within unstructured -clinician notes in EHRs, but manual chart reviews are both time-consuming and -error-prone. To address this issue, our study evaluates an automated approach -using zero-shot GPT-4o to determine stage of cognitive impairment in two -different tasks. First, we evaluated the ability of GPT-4o to determine the -global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who -visited the memory clinic at Massachusetts General Hospital (MGH), and achieved -a weighted kappa score of 0.83. Second, we assessed GPT-4o's ability to -differentiate between normal cognition, mild cognitive impairment (MCI), and -dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o -attained a weighted kappa score of 0.91 in comparison to specialist chart -reviews and 0.96 on cases that the clinical adjudicators rated with high -confidence. Our findings demonstrate GPT-4o's potential as a scalable chart -review tool for creating research datasets and assisting diagnosis in clinical -settings in the future. +Retrieval-augmented language models often struggle with knowledge-intensive +tasks due to inefficient retrieval, unstructured knowledge integration, and +single-pass architectures. We present Retrieval-And-Structuring (RAS), a novel +framework that dynamically constructs and reasons over query-specific knowledge +graphs through iterative retrieval and structuring. RAS introduces four key +technical innovations: (1) a themescoped retrieval mechanism that efficiently +narrows the search space while maintaining retrieval quality, (2) an action +planning module that determines knowledge needs and generates focused +sub-queries, (3) a dynamic knowledge structuring approach that converts +retrieved text into an evolving knowledge graph, and (4) a graph-augmented +answering component that leverages the accumulated structured information. Our +framework achieves state-of-the-art performance, surpassing leading baselines +by 6.4% with open-source language models and 7.0% with proprietary models on +seven knowledge-intensive generation datasets across all evaluation metrics. +Detailed ablation studies verify the contribution of each technical component +to the overall system performance. -摘要:在電子健康記錄 (EHR) 中識別認知障礙不僅對及時診斷至關重要,也有助於促進研究。有關認知障礙的資訊通常存在於 EHR 中非結構化的臨床記錄中,但手動圖表審查既耗時又容易出錯。為了解決這個問題,我們的研究評估了一種自動化方法,使用零次學習的 GPT-4o 來確定兩種不同任務中的認知障礙分期。首先,我們評估了 GPT-4o 確定來自麻薩諸塞州總醫院 (MGH) 記憶診所 769 名患者的專科記錄的全球臨床痴呆評分 (CDR) 的能力,並獲得了 0.83 的加權 kappa 分數。其次,我們評估了 GPT-4o 在 860 名 Medicare 患者 3 年視窗中的所有記錄中區分正常認知、輕度認知障礙 (MCI) 和痴呆的能力。與專科圖表審查相比,GPT-4o 獲得了 0.91 的加權 kappa 分數,而對於臨床評審員以高度信心評估的病例,其加權 kappa 分數為 0.96。我們的研究結果證明了 GPT-4o 作為可擴充圖表審查工具的潛力,可用於建立研究資料集並協助未來臨床環境中的診斷。 +摘要:检索增强语言模型通常会因检索效率低、知识整合无结构和单次通过架构而难以胜任知识密集型任务。我们提出检索和结构化 (RAS),这是一个新颖的框架,通过迭代检索和结构化,动态构建和推理特定于查询的知识图谱。RAS 引入了四项关键技术创新:(1) 主题范围检索机制,在保持检索质量的同时有效缩小搜索空间,(2) 动作规划模块,确定知识需求并生成重点子查询,(3) 动态知识结构化方法,将检索到的文本转换为不断发展的知识图谱,以及 (4) 图谱增强型回答组件,利用累积的结构化信息。我们的框架实现了最先进的性能,在七个知识密集型生成数据集上,使用开源语言模型提高了 6.4%,使用专有模型提高了 7.0%,超越了领先的基线,且所有评估指标均如此。详细的消融研究验证了每个技术组件对整体系统性能的贡献。 -##### **Metamorphic Testing for Pose Estimation Systems** -2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque +##### **Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia** +2502.10896v1 by Rohith Perumandla, Young-Ho Bae, Diego Izaguirre, Esther Hwang, Andrew Murphy, Long-Jing Hsu, Selma Sabanovic, Casey C. Bennett -Pose estimation systems are used in a variety of fields, from sports -analytics to livestock care. Given their potential impact, it is paramount to -systematically test their behaviour and potential for failure. This is a -complex task due to the oracle problem and the high cost of manual labelling -necessary to build ground truth keypoints. This problem is exacerbated by the -fact that different applications require systems to focus on different subjects -(e.g., human versus animal) or landmarks (e.g., only extremities versus whole -body and face), which makes labelled test data rarely reusable. To combat these -problems we propose MET-POSE, a metamorphic testing framework for pose -estimation systems that bypasses the need for manual annotation while assessing -the performance of these systems under different circumstances. MET-POSE thus -allows users of pose estimation systems to assess the systems in conditions -that more closely relate to their application without having to label an ad-hoc -test dataset or rely only on available datasets, which may not be adapted to -their application domain. While we define MET-POSE in general terms, we also -present a non-exhaustive list of metamorphic rules that represent common -challenges in computer vision applications, as well as a specific way to -evaluate these rules. We then experimentally show the effectiveness of MET-POSE -by applying it to Mediapipe Holistic, a state of the art human pose estimation -system, with the FLIC and PHOENIX datasets. With these experiments, we outline -numerous ways in which the outputs of MET-POSE can uncover faults in pose -estimation systems at a similar or higher rate than classic testing using hand -labelled data, and show that users can tailor the rule set they use to the -faults and level of accuracy relevant to their application. +This study presents the development and testing of a conversational speech +system designed for robots to detect speech biomarkers indicative of cognitive +impairments in people living with dementia (PLwD). The system integrates a +backend Python WebSocket server and a central core module with a large language +model (LLM) fine-tuned for dementia to process user input and generate robotic +conversation responses in real-time in less than 1.5 seconds. The frontend user +interface, a Progressive Web App (PWA), displays information and biomarker +score graphs on a smartphone in real-time to human users (PLwD, caregivers, +clinicians). Six speech biomarkers based on the existing literature - Altered +Grammar, Pragmatic Impairments, Anomia, Disrupted Turn-Taking, Slurred +Pronunciation, and Prosody Changes - were developed for the robot conversation +system using two datasets, one that included conversations of PLwD with a human +clinician (DementiaBank dataset) and one that included conversations of PLwD +with a robot (Indiana dataset). We also created a composite speech biomarker +that combined all six individual biomarkers into a single score. The speech +system's performance was first evaluated on the DementiaBank dataset showing +moderate correlation with MMSE scores, with the composite biomarker score +outperforming individual biomarkers. Analysis of the Indiana dataset revealed +higher and more variable biomarker scores, suggesting potential differences due +to study populations (e.g. severity of dementia) and the conversational +scenario (human-robot conversations are different from human-human). The +findings underscore the need for further research on the impact of +conversational scenarios on speech biomarkers and the potential clinical +applications of robotic speech systems. -摘要:姿勢估計系統應用於各種領域,從運動分析到牲畜照護。鑑於其潛在影響,系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高,這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體(例如,人類對動物)或地標(例如,只有四肢對全身和臉部)而加劇,這使得標記的測試數據很少可以重複使用。為了解決這些問題,我們提出了 MET-POSE,這是一個姿勢估計系統的變形測試框架,在評估這些系統在不同情況下的性能時,可以繞過手動註解的需要。因此,MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統,而無需標記臨時測試數據集或僅依賴可用數據集,這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE,但我們也提供了一個非詳盡的變形規則列表,這些規則代表了電腦視覺應用中的常見挑戰,以及評估這些規則的具體方法。然後,我們通過將 MET-POSE 應用於 Mediapipe Holistic(一種先進的人類姿勢估計系統),並使用 FLIC 和 PHOENIX 數據集,以實驗方式展示 MET-POSE 的有效性。通過這些實驗,我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法,其速度與使用手動標記數據的傳統測試類似或更高,並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。 +摘要:本研究展示了對話式語音系統的開發和測試,該系統專為機器人設計,用於偵測失智症患者(PLwD)認知障礙的語言生物標記。該系統整合了後端 Python WebSocket 伺服器和一個中央核心模組,其中包含針對失智症微調的大語言模型(LLM),以處理使用者輸入並在不到 1.5 秒的時間內產生機器人對話回應。前端使用者介面(漸進式網路應用程式,PWA)會在智慧型手機上即時向人類使用者(PLwD、照護者、臨床醫生)顯示資訊和生物標記評分圖表。根據現有文獻,針對機器人對話系統開發了六個語言生物標記:語法改變、實用障礙、失語症、輪流中斷、發音不清和韻律變化,使用了兩個資料集,一個包含 PLwD 與人類臨床醫生對話(DementiaBank 資料集),另一個包含 PLwD 與機器人對話(Indiana 資料集)。我們還建立了一個複合語言生物標記,將所有六個個別生物標記組合成一個單一評分。語言系統的效能首先在 DementiaBank 資料集上進行評估,顯示與 MMSE 評分有中等相關性,複合生物標記評分優於個別生物標記。對 Indiana 資料集的分析顯示出較高且變異性較大的生物標記評分,這表明由於研究族群(例如失智症的嚴重程度)和對話情境(人機對話與人際對話不同)而產生潛在差異。研究結果強調需要進一步研究對話情境對語言生物標記的影響,以及機器人語言系統的潛在臨床應用。 -##### **Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling** -2502.09688v1 by Benjamin D. Killeen, Bohua Wan, Aditya V. Kulkarni, Nathan Drenkow, Michael Oberst, Paul H. Yi, Mathias Unberath +##### **Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)** +2502.10768v1 by Sandra Schaftner -Artificial intelligence (AI) is poised to transform healthcare by enabling -personalized and efficient care through data-driven insights. Although -radiology is at the forefront of AI adoption, in practice, the potential of AI -models is often overshadowed by severe failures to generalize: AI models can -have performance degradation of up to 20% when transitioning from controlled -test environments to clinical use by radiologists. This mismatch raises -concerns that radiologists will be misled by incorrect AI predictions in -practice and/or grow to distrust AI, rendering these promising technologies -practically ineffectual. Exhaustive clinical trials of AI models on abundant -and diverse data is thus critical to anticipate AI model degradation when -encountering varied data samples. Achieving these goals, however, is -challenging due to the high costs of collecting diverse data samples and -corresponding annotations. To overcome these limitations, we introduce a novel -conditional generative AI model designed for virtual clinical trials (VCTs) of -radiology AI, capable of realistically synthesizing full-body CT images of -patients with specified attributes. By learning the joint distribution of -images and anatomical structures, our model enables precise replication of -real-world patient populations with unprecedented detail at this scale. We -demonstrate meaningful evaluation of radiology AI models through VCTs powered -by our synthetic CT study populations, revealing model degradation and -facilitating algorithmic auditing for bias-inducing data attributes. Our -generative AI approach to VCTs is a promising avenue towards a scalable -solution to assess model robustness, mitigate biases, and safeguard patient -care by enabling simpler testing and evaluation of AI models in any desired -range of diverse patient populations. +Current research highlights the great potential of Large Language Models +(LLMs) for constructing Scholarly Knowledge Graphs (SKGs). One particularly +complex step in this process is relation extraction, aimed at identifying +suitable properties to describe the content of research. This study builds +directly on previous research of three Open Research Knowledge Graph (ORKG) +team members who assessed the readiness of LLMs such as GPT-3.5, Llama 2, and +Mistral for property extraction in scientific literature. Given the moderate +performance observed, the previous work concluded that fine-tuning is needed to +improve these models' alignment with scientific tasks and their emulation of +human expertise. Expanding on this prior experiment, this study evaluates the +impact of advanced prompt engineering techniques and demonstrates that these +techniques can highly significantly enhance the results. Additionally, this +study extends the property extraction process to include property matching to +existing ORKG properties, which are retrieved via the API. The evaluation +reveals that results generated through advanced prompt engineering achieve a +higher proportion of matches with ORKG properties, further emphasizing the +enhanced alignment achieved. Moreover, this lays the groundwork for addressing +challenges such as the inconsistency of ORKG properties, an issue highlighted +in prior studies. By assigning unique URIs and using standardized terminology, +this work increases the consistency of the properties, fulfilling a crucial +aspect of Linked Data and FAIR principles - core commitments of ORKG. This, in +turn, significantly enhances the applicability of ORKG content for subsequent +tasks such as comparisons of research publications. Finally, the study +concludes with recommendations for future improvements in the overall property +extraction process. -摘要:人工智慧 (AI) 準備透過資料驅動的見解,轉型醫療保健,並提供個人化且有效率的照護。儘管放射科處於 AI 採用的最前線,但在實務上,AI 模型的潛力往往會被嚴重的概化失敗所掩蓋:AI 模型在從受控測試環境轉移到放射科醫師的臨床使用時,效能可能會降低多達 20%。這種不匹配引發了疑慮,即放射科醫師在實務上會被不正確的 AI 預測誤導,和/或開始不信任 AI,讓這些有前景的技術在實務上形同失效。因此,在 AI 模型遭遇各種資料範例時,預期 AI 模型的衰退,對豐富且多樣化的資料進行 AI 模型的全面臨床試驗至關重要。然而,由於收集多樣化的資料範例和對應註解的成本很高,實現這些目標具有挑戰性。為了克服這些限制,我們引進一個創新的條件式生成式 AI 模型,專門用於放射科 AI 的虛擬臨床試驗 (VCT),能夠真實地合成具有特定屬性的病患全身電腦斷層 (CT) 影像。透過學習影像和解剖結構的聯合分佈,我們的模型能夠以空前的細節精確複製真實世界的病患族群。我們透過由我們合成的電腦斷層研究族群支援的 VCT,展示了放射科 AI 模型有意義的評估,揭露模型衰退,並促進演算法稽核,以找出導致偏差的資料屬性。我們對 VCT 的生成式 AI 方法,是一個有前景的途徑,可以評估模型的穩健性、減輕偏差,並透過在任何所需的各種病患族群中,進行更簡單的 AI 模型測試和評估,來保障病患照護。 +摘要:目前的調查強調大語言模型 (LLM) 在建構學術知識圖譜 (SKG) 上的巨大潛力。此過程中特別複雜的步驟是關係萃取,目標是找出合適的屬性來描述研究內容。本研究直接建立在三位開放研究知識圖譜 (ORKG) 團隊成員先前研究的基礎上,他們評估了 GPT-3.5、Llama 2 和 Mistral 等 LLM 在科學文獻中萃取屬性的準備情況。鑑於觀察到的表現中等,先前的研究結論是需要微調,以改善這些模型與科學任務的一致性,以及它們對人類專業知識的模擬。本研究擴展了先前的實驗,評估了進階提示工程技術的影響,並證明這些技術可以大幅顯著地提升結果。此外,本研究將屬性萃取流程擴展到包含與現有 ORKG 屬性的屬性比對,這些屬性是透過 API 擷取的。評估結果顯示,透過進階提示工程產生的結果與 ORKG 屬性有更高的比對比例,進一步強調所達成的進階一致性。此外,這也為了解決先前的研究中強調的問題,例如 ORKG 屬性的不一致性,奠定了基礎。透過指定唯一的 URI 並使用標準化的術語,本研究增加了屬性的相容性,達成了連結資料和 FAIR 原則的重要層面,這是 ORKG 的核心承諾。這反過來大幅提升了 ORKG 內容在後續任務中的適用性,例如研究出版品的比較。最後,本研究以針對整體屬性萃取流程未來改進的建議作為結論。 -##### **Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models** -2502.09687v1 by Wiktoria Mieleszczenko-Kowszewicz, Beata Bajcar, Jolanta Babiak, Berenika Dyczek, Jakub Świstak, Przemysław Biecek +##### **K-Edit: Language Model Editing with Contextual Knowledge Awareness** +2502.10626v1 by Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan -Be careful what you ask for, you just might get it. This saying fits with the -way large language models (LLMs) are trained, which, instead of being rewarded -for correctness, are increasingly rewarded for pleasing the recipient. So, they -are increasingly effective at persuading us that their answers are valuable. -But what tricks do they use in this persuasion? In this study, we examine what -are the psycholinguistic features of the responses used by twelve different -language models. By grouping response content according to rational or -emotional prompts and exploring social influence principles employed by LLMs, -we ask whether and how we can mitigate the risks of LLM-driven mass -misinformation. We position this study within the broader discourse on -human-centred AI, emphasizing the need for interdisciplinary approaches to -mitigate cognitive and societal risks posed by persuasive AI responses. +As the world changes, we need to be able to update our models and correct +false information without costly retraining. Knowledge-based model editing +enables precise modifications to the weights of large language models in order +to modify the information encoded within. Recent approaches have seen success +in enabling recall of edited information for thousands of edits at once. +However, these approaches fail to produce edits that account for associated +contextual information. We present K-Edit, an effective approach to generating +contextually consistent knowledge edits. By using knowledge graphs, which +maintain contextual consistency when an edge is edited, we are able to generate +additional \textit{contextual edits} that ensure consistency of related +information in the language model. Our experiments demonstrate significant +improvements in multi-hop question answering while maintaining the general +effectiveness and scalability of model edits. -摘要:小心你要求的,你可能真的會得到。這句話適用於大型語言模型 (LLM) 的訓練方式,它們不是因為正確性而獲得獎勵,而是因為取悅接收者而獲得越來越多的獎勵。因此,它們越來越有效地說服我們,它們的答案是有價值的。但是它們在這種說服中使用什麼技巧呢?在這項研究中,我們探討了十二種不同的語言模型使用的回應的心理語言特徵。通過根據理性和情緒提示對回應內容進行分組,並探討 LLM 使用的社會影響原則,我們探討是否以及如何減輕 LLM 驅動的大規模錯誤信息的風險。我們將這項研究定位在以人為中心的 AI 的更廣泛討論中,強調需要跨學科方法來減輕具有說服力的 AI 回應帶來的認知和社會風險。 +摘要:隨著世界變化,我們需要能夠更新我們的模型,並在不進行昂貴的重新訓練的情況下更正錯誤資訊。基於知識的模型編輯能夠對大型語言模型的權重進行精確修改,以便修改其中編碼的資訊。最近的方法在一次啟用數千次編輯的編輯資訊的召回方面取得了成功。然而,這些方法無法產生考慮相關上下文資訊的編輯。我們提出 K-Edit,這是一種產生上下文一致的知識編輯的有效方法。通過使用知識圖,在編輯邊緣時保持上下文一致性,我們能夠產生額外的「上下文編輯」,以確保語言模型中相關資訊的一致性。我們的實驗證明了多跳問題回答的顯著改進,同時保持了模型編輯的一般有效性和可擴充性。 -##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics** -2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing +##### **ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis** +2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan -Joint entity-relation extraction is a critical task in transforming -unstructured or semi-structured text into triplets, facilitating the -construction of large-scale knowledge graphs, and supporting various downstream -applications. Despite its importance, research on Chinese text, particularly -with complex semantics in specialized domains like medicine, remains limited. -To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions -dataset designed to capture the intricacies of medical text. Leveraging the -strengths of attention mechanisms in capturing long-range dependencies, we -propose the SEA module, which enhances the extraction of complex contextual -semantic information, thereby improving entity recognition and relation -extraction. Additionally, to address the inefficiencies of existing methods in -facilitating information exchange between entity recognition and relation -extraction, we present an interactive fusion representation module. This module -employs Cross Attention for bidirectional information exchange between the -tasks and further refines feature extraction through BiLSTM. Experimental -results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that -our model exhibits strong generalization capabilities. On the CH-DDI dataset, -our model achieves an F1-score of 96.73% for entity recognition and 78.43% for -relation extraction. On the CoNLL04 dataset, it attains an entity recognition -precision of 89.54% and a relation extraction accuracy of 71.64%. +Recent advancements in large language models (LLMs) have demonstrated +extraordinary comprehension capabilities with remarkable breakthroughs on +various vision-language tasks. However, the application of LLMs in generating +reliable medical diagnostic reports remains in the early stages. Currently, +medical LLMs typically feature a passive interaction model where doctors +respond to patient queries with little or no involvement in analyzing medical +images. In contrast, some ChatBots simply respond to predefined queries based +on visual inputs, lacking interactive dialogue or consideration of medical +history. As such, there is a gap between LLM-generated patient-ChatBot +interactions and those occurring in actual patient-doctor consultations. To +bridge this gap, we develop an LLM-based dialogue system, namely proactive +multi-round vision-language interactions for computer-aided diagnosis +(ProMRVL-CAD), to generate patient-friendly disease diagnostic reports. The +proposed ProMRVL-CAD system allows proactive dialogue to provide patients with +constant and reliable medical access via an integration of knowledge graph into +a recommendation system. Specifically, we devise two generators: a Proactive +Question Generator (Pro-Q Gen) to generate proactive questions that guide the +diagnostic procedure and a Multi-Vision Patient-Text Diagnostic Report +Generator (MVP-DR Gen) to produce high-quality diagnostic reports. Evaluating +two real-world publicly available datasets, MIMIC-CXR and IU-Xray, our model +has better quality in generating medical reports. We further demonstrate the +performance of ProMRVL achieves robust under the scenarios with low image +quality. Moreover, we have created a synthetic medical dialogue dataset that +simulates proactive diagnostic interactions between patients and doctors, +serving as a valuable resource for training LLM. + +摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。 + +##### **GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs** +2502.10522v1 by Shima Khoshraftar, Niaz Abedini, Amir Hajian + +The application of large language models (LLMs) to graph data has attracted a +lot of attention recently. LLMs allow us to use deep contextual embeddings from +pretrained models in text-attributed graphs, where shallow embeddings are often +used for the text attributes of nodes. However, it is still challenging to +efficiently encode the graph structure and features into a sequential form for +use by LLMs. In addition, the performance of an LLM alone, is highly dependent +on the structure of the input prompt, which limits their effectiveness as a +reliable approach and often requires iterative manual adjustments that could be +slow, tedious and difficult to replicate programmatically. In this paper, we +propose GraphiT (Graphs in Text), a framework for encoding graphs into a +textual format and optimizing LLM prompts for graph prediction tasks. Here we +focus on node classification for text-attributed graphs. We encode the graph +data for every node and its neighborhood into a concise text to enable LLMs to +better utilize the information in the graph. We then further programmatically +optimize the LLM prompts using the DSPy framework to automate this step and +make it more efficient and reproducible. GraphiT outperforms our LLM-based +baselines on three datasets and we show how the optimization step in GraphiT +leads to measurably better results without manual prompt tweaking. We also +demonstrated that our graph encoding approach is competitive to other graph +encoding methods while being less expensive because it uses significantly less +tokens for the same task. -摘要:聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務,有助於建構大規模知識圖譜,並支援各種下游應用程式。儘管其重要性,但針對中文文本的研究,特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距,我們引入了 CH-DDI,一個中文藥物-藥物交互作用資料集,旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢,我們提出了 SEA 模組,增強了複雜脈絡語義資訊的抽取,從而改進了實體辨識和關係抽取。此外,為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題,我們提出了互動式融合表示模組。此模組採用交叉注意力,在任務之間進行雙向資訊交換,並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明,我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上,我們的模型在實體辨識方面達到了 96.73% 的 F1 分數,在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上,它在實體辨識方面達到了 89.54% 的準確度,在關係抽取方面達到了 71.64% 的準確度。 +摘要:大型語言模型 (LLM) 在圖表資料的應用最近備受關注。LLM 讓我們能夠在文字標記圖表中使用預訓練模型的深度脈絡嵌入,其中淺層嵌入通常用於節點的文字屬性。然而,要有效率地將圖表結構和特徵編碼成序列形式供 LLM 使用,仍然是一項挑戰。此外,單獨 LLM 的效能高度依賴輸入提示的結構,這限制了它們作為可靠方法的有效性,而且通常需要反覆的人工調整,這可能會緩慢、繁瑣且難以透過程式複製。在本文中,我們提出 GraphiT(文字中的圖表),一個用於將圖表編碼成文字格式並最佳化 LLM 提示以進行圖表預測任務的架構。在這裡,我們專注於文字標記圖表的節點分類。我們將每個節點及其鄰域的圖表資料編碼成簡潔的文字,讓 LLM 能夠更好地利用圖表中的資訊。然後,我們進一步透過程式最佳化 LLM 提示,使用 DSPy 架構自動化這個步驟,並使其更有效率且可複製。Graphite 在三個資料集上優於我們的基於 LLM 的基準,我們展示了 GraphiT 中的最佳化步驟如何導致顯著更好的結果,而無需手動調整提示。我們還證明了我們的圖表編碼方法與其他圖表編碼方法具有競爭力,同時成本更低,因為它在相同的任務中使用了顯著更少的標記。 -##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine** -2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh +##### **Do Large Language Models Reason Causally Like Us? Even Better?** +2502.10215v1 by Hanna M. Dettki, Brenden M. Lake, Charley M. Wu, Bob Rehder -Generative artificial intelligence (AI) models, such as diffusion models and -OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy -and automating clinical workflows. The field has advanced rapidly, evolving -from text-only large language models for tasks such as clinical documentation -and decision support to multimodal AI systems capable of integrating diverse -data modalities, including imaging, text, and structured data, within a single -model. The diverse landscape of these technologies, along with rising interest, -highlights the need for a comprehensive review of their applications and -potential. This scoping review explores the evolution of multimodal AI, -highlighting its methods, applications, datasets, and evaluation in clinical -settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed, -IEEE Xplore, and Web of Science, prioritizing recent studies published up to -the end of 2024. After rigorous screening, 144 papers were included, revealing -key trends and challenges in this dynamic field. Our findings underscore a -shift from unimodal to multimodal approaches, driving innovations in diagnostic -support, medical report generation, drug discovery, and conversational AI. -However, critical challenges remain, including the integration of heterogeneous -data types, improving model interpretability, addressing ethical concerns, and -validating AI systems in real-world clinical settings. This review summarizes -the current state of the art, identifies critical gaps, and provides insights -to guide the development of scalable, trustworthy, and clinically impactful -multimodal AI solutions in healthcare. +Causal reasoning is a core component of intelligence. Large language models +(LLMs) have shown impressive capabilities in generating human-like text, +raising questions about whether their responses reflect true understanding or +statistical patterns. We compared causal reasoning in humans and four LLMs +using tasks based on collider graphs, rating the likelihood of a query variable +occurring given evidence from other variables. We find that LLMs reason +causally along a spectrum from human-like to normative inference, with +alignment shifting based on model, context, and task. Overall, GPT-4o and +Claude showed the most normative behavior, including "explaining away", whereas +Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected +independence of causes - Claude the least - they exhibited strong associative +reasoning and predictive inference when assessing the likelihood of the effect +given its causes. These findings underscore the need to assess AI biases as +they increasingly assist human decision-making. -摘要:生成式人工智能 (AI) 模型,例如扩散模型和 OpenAI 的 ChatGPT,通过提高诊断准确性和自动化临床工作流程,正在改变医学领域。该领域已迅速发展,从用于临床文件编制和决策支持等任务的纯文本大型语言模型,发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣,凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变,重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南,我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science,优先考虑截至 2024 年底发表的最新研究。经过严格筛选,纳入了 144 篇论文,揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变,推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而,关键挑战仍然存在,包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术,确定了关键差距,并提供了见解,以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。 +摘要:因果推理是智能的核心組成部分。大型語言模型 (LLM) 在生成類人文本方面展現了令人印象深刻的能力,引發了關於它們的回應是否反映真實理解或統計模式的疑問。我們使用基於碰撞圖的任務比較了人類和四個 LLM 中的因果推理,根據其他變數的證據評估查詢變數發生的可能性。我們發現 LLM 沿著從類人到規範推論的光譜進行因果推理,對齊會根據模型、上下文和任務而改變。總體而言,GPT-4o 和 Claude 表現出最規範的行為,包括「解釋」,而 Gemini-Pro 和 GPT-3.5 則沒有。儘管所有代理都偏離了預期的原因獨立性 - Claude 最不偏離 - 但它們在評估給定原因的效果可能性時表現出強烈的關聯推理和預測推論。這些發現強調了評估 AI 偏差的必要性,因為它們越來越協助人類決策。 -##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** -2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano +##### **Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages** +2502.10140v1 by Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann -This paper presents a complete explainable system that interprets a set of -data, abstracts the underlying features and describes them in a natural -language of choice. The system relies on two crucial stages: (i) identifying -emerging properties from data and transforming them into abstract concepts, and -(ii) converting these concepts into natural language. Despite the impressive -natural language generation capabilities demonstrated by Large Language Models, -their statistical nature and the intricacy of their internal mechanism still -force us to employ these techniques as black boxes, forgoing trustworthiness. -Developing an explainable pipeline for data interpretation would allow -facilitating its use in safety-critical environments like processing medical -information and allowing non-experts and visually impaired people to access -narrated information. To this end, we believe that the fields of knowledge -representation and automated reasoning research could present a valid -alternative. Expanding on prior research that tackled the first stage (i), we -focus on the second stage, named Concept2Text. Being explainable, data -translation is easily modeled through logic-based rules, once again emphasizing -the role of declarative programming in achieving AI explainability. This paper -explores a Prolog/CLP-based rewriting system to interpret concepts-articulated -in terms of classes and relations, plus common knowledge-derived from a generic -ontology, generating natural language text. Its main features include -hierarchical tree rewritings, modular multilingual generation, support for -equivalent variants across semantic, grammar, and lexical levels, and a -transparent rule-based system. We outline the architecture and demonstrate its -flexibility through some examples capable of generating numerous diverse and -equivalent rewritings based on the input concept. +Low-resource languages (LRLs) face significant challenges in natural language +processing (NLP) due to limited data. While current state-of-the-art large +language models (LLMs) still struggle with LRLs, smaller multilingual models +(mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of +their capacity to low training data sizes. This study systematically +investigates parameter-efficient adapter-based methods for adapting mLMs to +LRLs, evaluating three architectures: Sequential Bottleneck, Invertible +Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and +structured knowledge from ConceptNet, we show that small adaptation datasets +(e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains +in intrinsic (masked language modeling) and extrinsic tasks (topic +classification, sentiment analysis, and named entity recognition). We find that +Sequential Bottleneck adapters excel in language modeling, while Invertible +Bottleneck adapters slightly outperform other methods on downstream tasks due +to better embedding alignment and larger parameter counts. Adapter-based +methods match or outperform full fine-tuning while using far fewer parameters, +and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, +GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves +performance, pre-training data size remains the dominant factor, especially for +languages with extensive pre-training coverage. -摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 +摘要:低資源語言 (LRL) 由於資料有限,在自然語言處理 (NLP) 中面臨重大挑戰。雖然當前最先進的大型語言模型 (LLM) 仍難以處理 LRL,但較小的多語言模型 (mLMS),例如 mBERT 和 XLM-R,由於其容量更適合低訓練資料大小,因此提供了更大的希望。本研究系統性地探討了基於參數效率適配器的適配方法,以將 mLMS 適配到 LRL,評估了三種架構:順序瓶頸、可逆瓶頸和低秩適配。使用來自 GlotCC 的非結構化文本和來自 ConceptNet 的結構化知識,我們表明小型適配資料集(例如,高達 1 GB 的自由文本或幾 MB 的知識圖譜資料)在內在(遮蔽語言模型)和外在任務(主題分類、情緒分析和命名實體識別)中產生增益。我們發現順序瓶頸適配器在語言模型中表現出色,而可逆瓶頸適配器由於更好的嵌入對齊和更大的參數數量,在下游任務上略勝於其他方法。基於適配器的方法在使用更少參數的同時,可以匹配或優於完全微調,而較小的 mLM 被證明比 LLaMA-3、GPT-4 和基於 DeepSeek-R1 的蒸餾模型等大型 LLM 更適合 LRL。雖然適配可以提高效能,但預訓練資料大小仍然是主要因素,特別是對於預訓練覆蓋範圍廣泛的語言。 -##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York** -2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu +##### **Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models** +2502.10090v1 by Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao -Legal cases require careful logical reasoning following the laws, whereas -interactions with non-technical users must be in natural language. As an -application combining logical reasoning using Prolog and natural language -processing using large language models (LLMs), this paper presents a novel -approach and system, LogicLease, to automate the analysis of landlord-tenant -legal cases in the state of New York. LogicLease determines compliance with -relevant legal requirements by analyzing case descriptions and citing all -relevant laws. It leverages LLMs for information extraction and Prolog for -legal reasoning. By separating information extraction from legal reasoning, -LogicLease achieves greater transparency and control over the legal logic -applied to each case. We evaluate the accuracy, efficiency, and robustness of -LogicLease through a series of tests, achieving 100% accuracy and an average -processing time of 2.57 seconds. LogicLease presents advantages over -state-of-the-art LLM-based legal analysis systems by providing clear, -step-by-step reasoning, citing specific laws, and distinguishing itself by its -ability to avoid hallucinations -- a common issue in LLMs. +Humans possess an extraordinary ability to understand and execute complex +manipulation tasks by interpreting abstract instruction manuals. For robots, +however, this capability remains a substantial challenge, as they cannot +interpret abstract instructions and translate them into executable actions. In +this paper, we present Manual2Skill, a novel framework that enables robots to +perform complex assembly tasks guided by high-level manual instructions. Our +approach leverages a Vision-Language Model (VLM) to extract structured +information from instructional images and then uses this information to +construct hierarchical assembly graphs. These graphs represent parts, +subassemblies, and the relationships between them. To facilitate task +execution, a pose estimation model predicts the relative 6D poses of components +at each assembly step. At the same time, a motion planning module generates +actionable sequences for real-world robotic implementation. We demonstrate the +effectiveness of Manual2Skill by successfully assembling several real-world +IKEA furniture items. This application highlights its ability to manage +long-horizon manipulation tasks with both efficiency and precision, +significantly enhancing the practicality of robot learning from instruction +manuals. This work marks a step forward in advancing robotic systems capable of +understanding and executing complex manipulation tasks in a manner akin to +human capabilities. -摘要:法律案件需要遵循法律进行谨慎的逻辑推理,而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序,本文提出了一种新颖的方法和系统 LogicLease,以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取,并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开,LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性,实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理,引用具体法律,并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统,从而显示出优势——这是 LLM 中的常见问题。 +摘要:人類擁有理解並執行複雜操作任務的非凡能力,方法是詮釋抽象的說明手冊。然而,對機器人來說,這項能力仍然是一項重大的挑戰,因為它們無法詮釋抽象的指令並將其轉換為可執行的動作。在本文中,我們提出了 Manual2Skill,這是一個新穎的框架,使機器人能夠在高階手冊說明的指導下執行複雜的組裝任務。我們的做法利用視覺語言模型 (VLM) 從教學圖片中提取結構化資訊,然後使用此資訊來建構階層式組裝圖。這些圖表示零件、子組件以及它們之間的關係。為了促進任務執行,姿勢估計模型會預測每個組裝步驟中組件的相對 6D 姿勢。同時,動作規劃模組會產生適用於實際機器人實作的可操作順序。我們透過成功組裝幾個真實世界的 IKEA 家具來展示 Manual2Skill 的有效性。此應用程式突顯了它以高效率和高精準度管理長時程操作任務的能力,大幅提升機器人從說明手冊中學習的實用性。這項工作標誌著機器人系統在理解和執行複雜操作任務方面向前邁進了一步,其方式類似於人類的能力。 -##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia** -2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott +##### **Decision Information Meets Large Language Models: The Future of Explainable Operations Research** +2502.09994v1 by Yansen Zhang, Qingcan Kang, Wing Yin Yu, Hailei Gong, Xiaojin Fu, Xiongwei Han, Tao Zhong, Chen Ma -In remote healthcare monitoring, time series representation learning reveals -critical patient behavior patterns from high-frequency data. This study -analyzes home activity data from individuals living with dementia by proposing -a two-stage, self-supervised learning approach tailored to uncover low-rank -structures. The first stage converts time-series activities into text sequences -encoded by a pre-trained language model, providing a rich, high-dimensional -latent state space using a PageRank-based method. This PageRank vector captures -latent state transitions, effectively compressing complex behaviour data into a -succinct form that enhances interpretability. This low-rank representation not -only enhances model interpretability but also facilitates clustering and -transition analysis, revealing key behavioral patterns correlated with -clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the -framework's potential in supporting cognitive status prediction, personalized -care interventions, and large-scale health monitoring. +Operations Research (OR) is vital for decision-making in many industries. +While recent OR methods have seen significant improvements in automation and +efficiency through integrating Large Language Models (LLMs), they still +struggle to produce meaningful explanations. This lack of clarity raises +concerns about transparency and trustworthiness in OR applications. To address +these challenges, we propose a comprehensive framework, Explainable Operations +Research (EOR), emphasizing actionable and understandable explanations +accompanying optimization. The core of EOR is the concept of Decision +Information, which emerges from what-if analysis and focuses on evaluating the +impact of complex constraints (or parameters) changes on decision-making. +Specifically, we utilize bipartite graphs to quantify the changes in the OR +model and adopt LLMs to improve the explanation capabilities. Additionally, we +introduce the first industrial benchmark to rigorously evaluate the +effectiveness of explanations and analyses in OR, establishing a new standard +for transparency and clarity in the field. -摘要:在遠程醫療監控中,時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據,該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列,使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換,有效地將複雜的行為數據壓縮成簡潔的形式,從而增強了解力。此低秩表示不僅增強了模型的可解釋性,還促進了聚類和轉換分析,揭示了與臨床指標(例如 MMSE 和 ADAS-COG 分數)相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。 +摘要:作業研究 (OR) 對許多產業的決策制定至關重要。雖然近期的 OR 方法已透過整合大型語言模型 (LLM) 在自動化和效率方面取得顯著的進步,但它們在產生有意義的解釋方面仍面臨挑戰。這種缺乏明確性的情況會對 OR 應用中的透明度和可信度造成疑慮。為了應對這些挑戰,我們提出一個全面的架構,即可解釋作業研究 (EOR),強調在最佳化過程中提供可操作且易於理解的解釋。EOR 的核心是決策資訊的概念,它源自假設分析,並專注於評估複雜約束條件 (或參數) 變更對決策制定的影響。具體來說,我們利用二部圖量化 OR 模型的變化,並採用 LLM 來改善解釋能力。此外,我們引入了第一個產業基準,以嚴格評估 OR 中解釋和分析的有效性,為該領域的透明度和清晰度建立新的標準。 -##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design** -2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang +##### **KGGen: Extracting Knowledge Graphs from Plain Text with Language Models** +2502.09956v1 by Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo -Taste peptides have emerged as promising natural flavoring agents attributed -to their unique organoleptic properties, high safety profile, and potential -health benefits. However, the de novo identification of taste peptides derived -from animal, plant, or microbial sources remains a time-consuming and -resource-intensive process, significantly impeding their widespread application -in the food industry. Here, we present TastePepAI, a comprehensive artificial -intelligence framework for customized taste peptide design and safety -assessment. As the key element of this framework, a loss-supervised adaptive -variational autoencoder (LA-VAE) is implemented to efficiently optimizes the -latent representation of sequences during training and facilitates the -generation of target peptides with desired taste profiles. Notably, our model -incorporates a novel taste-avoidance mechanism, allowing for selective flavor -exclusion. Subsequently, our in-house developed toxicity prediction algorithm -(SpepToxPred) is integrated in the framework to undergo rigorous safety -evaluation of generated peptides. Using this integrated platform, we -successfully identified 73 peptides exhibiting sweet, salty, and umami, -significantly expanding the current repertoire of taste peptides. This work -demonstrates the potential of TastePepAI in accelerating taste peptide -discovery for food applications and provides a versatile framework adaptable to -broader peptide engineering challenges. +Recent interest in building foundation models for KGs has highlighted a +fundamental challenge: knowledge-graph data is relatively scarce. The +best-known KGs are primarily human-labeled, created by pattern-matching, or +extracted using early NLP techniques. While human-generated KGs are in short +supply, automatically extracted KGs are of questionable quality. We present a +solution to this data scarcity problem in the form of a text-to-KG generator +(KGGen), a package that uses language models to create high-quality graphs from +plaintext. Unlike other KG extractors, KGGen clusters related entities to +reduce sparsity in extracted KGs. KGGen is available as a Python library +(\texttt{pip install kg-gen}), making it accessible to everyone. Along with +KGGen, we release the first benchmark, Measure of of Information in Nodes and +Edges (MINE), that tests an extractor's ability to produce a useful KG from +plain text. We benchmark our new tool against existing extractors and +demonstrate far superior performance. -摘要:味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而,从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程,严重阻碍了它们在食品工业中的广泛应用。在此,我们提出了 TastePepAI,这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素,实现了损失监督自适应变分自动编码器 (LA-VAE),以在训练期间有效优化序列的潜在表示,并促进生成具有所需味觉特征的目标肽。值得注意的是,我们的模型包含了一种新颖的味觉回避机制,允许选择性排除风味。随后,我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中,以对生成的肽进行严格的安全评估。使用这个集成平台,我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽,极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力,并提供了一个适用于更广泛的肽工程挑战的多功能框架。 +摘要:最近对于构建知识图谱基础模型的兴趣凸显了一个基本挑战:知识图谱数据相对稀缺。最知名的知识图谱主要为人标注,由模式匹配创建,或使用早期自然语言处理技术提取。虽然人生成的知识图谱供不应求,但自动提取的知识图谱质量堪忧。我们以文本到知识图谱生成器 (KGGen) 的形式为这一数据稀缺问题提供了一个解决方案,这是一个使用语言模型从纯文本创建高质量图表的包。与其他知识图谱提取器不同,KGGen 对相关实体进行聚类以减少提取的知识图谱中的稀疏性。KGGen 可用作 Python 库(\texttt{pip install kg-gen}),使其所有人都能访问。除了 KGGen,我们还发布了第一个基准测试,即节点和边信息度量 (MINE),它测试了提取器从纯文本生成有用知识图谱的能力。我们针对现有提取器对我们的新工具进行基准测试,并展示了远超其性能。 -##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification** -2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan +##### **ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation** +2502.09891v1 by Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, Yuchi Ma -Precise segmentation and classification of cell instances are vital for -analyzing the tissue microenvironment in histology images, supporting medical -diagnosis, prognosis, treatment planning, and studies of brain -cytoarchitecture. However, the creation of high-quality annotated datasets for -training remains a major challenge. This study introduces a novel single-stage -approach (HistoSmith) for generating image-label pairs to augment histology -datasets. Unlike state-of-the-art methods that utilize diffusion models with -separate components for label and image generation, our approach employs a -latent diffusion model to learn the joint distribution of cellular layouts, -classification masks, and histology images. This model enables tailored data -generation by conditioning on user-defined parameters such as cell types, -quantities, and tissue types. Trained on the Conic H&E histopathology dataset -and the Nissl-stained CytoDArk0 dataset, the model generates realistic and -diverse labeled samples. Experimental results demonstrate improvements in cell -instance segmentation and classification, particularly for underrepresented -cell types like neutrophils in the Conic dataset. These findings underscore the -potential of our approach to address data scarcity challenges. +Retrieval-Augmented Generation (RAG) has proven effective in integrating +external knowledge into large language models (LLMs) for question-answer (QA) +tasks. The state-of-the-art RAG approaches often use the graph data as the +external data since they capture the rich semantic information and link +relationships between entities. However, existing graph-based RAG approaches +cannot accurately identify the relevant information from the graph and also +consume large numbers of tokens in the online retrieval process. To address +these issues, we introduce a novel graph-based RAG approach, called Attributed +Community-based Hierarchical RAG (ArchRAG), by augmenting the question using +attributed communities, and also introducing a novel LLM-based hierarchical +clustering method. To retrieve the most relevant information from the graph for +the question, we build a novel hierarchical index structure for the attributed +communities and develop an effective online retrieval method. Experimental +results demonstrate that ArchRAG outperforms existing methods in terms of both +accuracy and token cost. -摘要:精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而,建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith),用於產生影像標籤對,以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同,我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數(例如細胞類型、數量和組織類型)來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後,此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步,特別是對於 Conic 資料集中代表性不足的細胞類型,例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。 +摘要:檢索增強生成 (RAG) 已證明可將外部知識整合到大型語言模型 (LLM),用於問答 (QA) 任務。最先進的 RAG 方法通常使用圖形資料作為外部資料,因為它們擷取了豐富的語意資訊和實體之間的連結關係。然而,現有的基於圖形的 RAG 方法無法準確識別圖形中的相關資訊,而且在線上檢索過程中也會消耗大量的符號。為了解決這些問題,我們提出了一種新穎的基於圖形的 RAG 方法,稱為基於屬性社群的分層 RAG (ArchRAG),透過使用屬性社群來擴充問題,並引入一種新穎的基於 LLM 的分層聚類方法。為了從圖形中檢索與問題最相關的資訊,我們為屬性社群建立了一個新穎的分層索引結構,並開發了一種有效的線上檢索方法。實驗結果證明,ArchRAG 在準確性和符號成本方面都優於現有方法。 -##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion** -2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì +##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing** +2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch -The growing availability of longitudinal Magnetic Resonance Imaging (MRI) -datasets has facilitated Artificial Intelligence (AI)-driven modeling of -disease progression, making it possible to predict future medical scans for -individual patients. However, despite significant advancements in AI, current -methods continue to face challenges including achieving patient-specific -individualization, ensuring spatiotemporal consistency, efficiently utilizing -longitudinal data, and managing the substantial memory demands of 3D scans. To -address these challenges, we propose Brain Latent Progression (BrLP), a novel -spatiotemporal model designed to predict individual-level disease progression -in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates -in a small latent space, mitigating the computational challenges posed by -high-dimensional imaging data; (ii) it explicitly integrates subject metadata -to enhance the individualization of predictions; (iii) it incorporates prior -knowledge of disease dynamics through an auxiliary model, facilitating the -integration of longitudinal data; and (iv) it introduces the Latent Average -Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in -the predicted progression at inference time and (b) allows us to derive a -measure of the uncertainty for the prediction. We train and evaluate BrLP on -11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its -generalizability on an external test set comprising 2,257 MRIs from 962 -subjects. Our experiments compare BrLP-generated MRI scans with real follow-up -MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The -code is publicly available at: https://github.com/LemuelPuglisi/BrLP. +Visual Question Answering (VQA) is a challenging problem that requires to +process multimodal input. Answer-Set Programming (ASP) has shown great +potential in this regard to add interpretability and explainability to modular +VQA architectures. In this work, we address the problem of how to integrate ASP +with modules for vision and natural language processing to solve a new and +demanding VQA variant that is concerned with images of graphs (not graphs in +symbolic form). Images containing graph-based structures are an ubiquitous and +popular form of visualisation. Here, we deal with the particular problem of +graphs inspired by transit networks, and we introduce a novel dataset that +amends an existing one by adding images of graphs that resemble metro lines. +Our modular neuro-symbolic approach combines optical graph recognition for +graph parsing, a pretrained optical character recognition neural network for +parsing labels, Large Language Models (LLMs) for language processing, and ASP +for reasoning. This method serves as a first baseline and achieves an overall +average accuracy of 73% on the dataset. Our evaluation provides further +evidence of the potential of modular neuro-symbolic systems, in particular with +pretrained models that do not involve any further training and logic +programming for reasoning, to solve complex VQA tasks. -摘要:隨著縱向磁共振影像 (MRI) 資料集的日益普及,已促進人工智慧 (AI) 驅動的疾病進程建模,讓預測個別患者的未來醫學掃描成為可能。然而,儘管 AI 有顯著進展,目前的技術仍面臨挑戰,包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料,以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰,我們提出腦潛在進程 (BrLP),這是一種新穎的時空模型,旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個:(i) 它在一個小的潛在空間中運作,減輕了高維度影像資料帶來的計算挑戰;(ii) 它明確整合受試者的元資料,以增強預測的個別化;(iii) 它透過輔助模型納入疾病動態的先驗知識,促進縱向資料的整合;(iv) 它引入了潛在平均穩定化 (LAS) 演算法,該演算法 (a) 在推論時強制預測進程中的時空一致性,(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估,並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較,與現有方法相比,展示了最先進的準確性。程式碼已公開於:https://github.com/LemuelPuglisi/BrLP。 +摘要:視覺問答(VQA)是一項具有挑戰性的問題,需要處理多模態輸入。答案集程式設計(ASP)在這方面顯示出巨大的潛力,可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中,我們探討如何將 ASP 與視覺和自然語言處理模組整合,以解決一個新的且要求嚴格的 VQA 變體,該變體與圖形影像(而非符號形式的圖形)有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡,我們處理受交通網路啟發的圖形特定問題,並引入一個新的資料集,透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型(LLM)進行語言處理,以及 ASP 進行推理。此方法作為第一個基準,在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力,特別是預先訓練的模型,這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理,以解決複雜的 VQA 任務。 ##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data** 2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai @@ -4499,3456 +3776,4108 @@ individual institutions. 摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。 -##### **EEG Artifact Detection and Correction with Deep Autoencoders** -2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch +##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy** +2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang -EEG signals convey important information about brain activity both in healthy -and pathological conditions. However, they are inherently noisy, which poses -significant challenges for accurate analysis and interpretation. Traditional -EEG artifact removal methods, while effective, often require extensive expert -intervention. This study presents LSTEEG, a novel LSTM-based autoencoder -designed for the detection and correction of artifacts in EEG signals. -Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear -dependencies in sequential EEG data. LSTEEG demonstrates superior performance -in both artifact detection and correction tasks compared to other -state-of-the-art convolutional autoencoders. Our methodology enhances the -interpretability and utility of the autoencoder's latent space, enabling -data-driven automated artefact removal in EEG its application in downstream -tasks. This research advances the field of efficient and accurate multi-channel -EEG preprocessing, and promotes the implementation and usage of automated EEG -analysis pipelines for brain health applications. +With the extensive application of Graph Neural Networks (GNNs) across various +domains, their trustworthiness has emerged as a focal point of research. Some +existing studies have shown that the integration of large language models +(LLMs) can improve the semantic understanding and generation capabilities of +GNNs, which in turn improves the trustworthiness of GNNs from various aspects. +Our review introduces a taxonomy that offers researchers a clear framework for +comprehending the principles and applications of different methods and helps +clarify the connections and differences among various approaches. Then we +systematically survey representative approaches along the four categories of +our taxonomy. Through our taxonomy, researchers can understand the applicable +scenarios, potential advantages, and limitations of each approach for the the +trusted integration of GNNs with LLMs. Finally, we present some promising +directions of work and future trends for the integration of LLMs and GNNs to +improve model trustworthiness. + +摘要:隨著圖神經網路 (GNN) 在各種領域的廣泛應用,其可信度已成為研究的焦點。一些現有研究表明,整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力,進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法,為研究人員提供了一個清晰的架構,用於理解不同方法的原理和應用,並有助於釐清各種方法之間的關聯和差異。然後,我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法,可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後,我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢,以提升模型的可信度。 + +##### **Graph Foundation Models for Recommendation: A Comprehensive Survey** +2502.08346v3 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi + +Recommender systems (RS) serve as a fundamental tool for navigating the vast +expanse of online information, with deep learning advancements playing an +increasingly important role in improving ranking accuracy. Among these, graph +neural networks (GNNs) excel at extracting higher-order structural information, +while large language models (LLMs) are designed to process and comprehend +natural language, making both approaches highly effective and widely adopted. +Recent research has focused on graph foundation models (GFMs), which integrate +the strengths of GNNs and LLMs to model complex RS problems more efficiently by +leveraging the graph-based structure of user-item relationships alongside +textual understanding. In this survey, we provide a comprehensive overview of +GFM-based RS technologies by introducing a clear taxonomy of current +approaches, diving into methodological details, and highlighting key challenges +and future directions. By synthesizing recent advancements, we aim to offer +valuable insights into the evolving landscape of GFM-based recommender systems. + +摘要:推薦系統 (RS) 是用於導航廣闊的線上資訊的基本工具,深度學習的進步在提升排名準確度方面扮演著日益重要的角色。其中,圖形神經網路 (GNN) 擅長萃取高階結構資訊,而大型語言模型 (LLM) 則設計用於處理和理解自然語言,這使得這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM),它整合了 GNN 和 LLM 的優點,透過利用使用者與項目關係的圖形化結構以及文字理解,更有效率地建構複雜的 RS 問題模型。在這項調查中,我們透過介紹當前方法的明確分類、深入探討方法論細節,以及強調關鍵挑戰和未來方向,提供了 GFM 為基礎的 RS 技術的全面概觀。透過綜合最近的進展,我們旨在提供對 GFM 為基礎的推薦系統不斷演變的版圖的寶貴見解。 + +##### **Self-Evaluation for Job-Shop Scheduling** +2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana + +Combinatorial optimization problems, such as scheduling and route planning, +are crucial in various industries but are computationally intractable due to +their NP-hard nature. Neural Combinatorial Optimization methods leverage +machine learning to address these challenges but often depend on sequential +decision-making, which is prone to error accumulation as small mistakes +propagate throughout the process. Inspired by self-evaluation techniques in +Large Language Models, we propose a novel framework that generates and +evaluates subsets of assignments, moving beyond traditional stepwise +approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a +heterogeneous graph neural network with a Transformer to build a policy model +and a self-evaluation function. Experimental validation on challenging, +well-known benchmarks demonstrates the effectiveness of our approach, +surpassing state-of-the-art methods. + +摘要:組合優化問題,例如排程和路線規劃,在各行各業中至關重要,但由於它們的 NP 難度,在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰,但通常依賴於序貫決策制定,而序貫決策制定容易發生錯誤累積,因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發,我們提出了一個新的框架,可生成和評估作業子集,超越傳統的分步方法。應用於工作車間排程問題,我們的方法將異質圖神經網路與 Transformer 整合在一起,以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性,超越了最先進的方法。 + +##### **Improving Existing Optimization Algorithms with LLMs** +2502.08298v1 by Camilo Chacón Sartori, Christian Blum + +The integration of Large Language Models (LLMs) into optimization has created +a powerful synergy, opening exciting research opportunities. This paper +investigates how LLMs can enhance existing optimization algorithms. Using their +pre-trained knowledge, we demonstrate their ability to propose innovative +heuristic variations and implementation strategies. To evaluate this, we +applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt +(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that +incorporates a heuristic in the solution construction phase. Our results show +that an alternative heuristic proposed by GPT-4o outperforms the +expert-designed heuristic of CMSA, with the performance gap widening on larger +and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/ + +摘要:大型语言模型 (LLM) 与优化相结合,创造了一种强大的协同作用,开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识,我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点,我们应用了一种非平凡的优化算法,构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法,它在求解构建阶段纳入了启发式算法。我们的结果表明,GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法,并且随着图形变得更大、更密集,性能差距也在扩大。项目网址:https://imp-opt-algo-llms.surge.sh/ + +##### **LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search** +2502.10459v1 by Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang + +Graph Neural Architecture Search (GNAS) facilitates the automatic design of +Graph Neural Networks (GNNs) tailored to specific downstream graph learning +tasks. However, existing GNAS approaches often require manual adaptation to new +graph search spaces, necessitating substantial code optimization and +domain-specific knowledge. To address this challenge, we present LLM4GNAS, a +toolkit for GNAS that leverages the generative capabilities of Large Language +Models (LLMs). LLM4GNAS includes an algorithm library for graph neural +architecture search algorithms based on LLMs, enabling the adaptation of GNAS +methods to new search spaces through the modification of LLM prompts. This +approach reduces the need for manual intervention in algorithm adaptation and +code modification. The LLM4GNAS toolkit is extensible and robust, incorporating +LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture +search, and LLM-enhanced hyperparameter optimization. Experimental results +indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving +both homogeneous and heterogeneous graphs. + +摘要:圖形神經架構搜尋 (GNAS) 促進圖形神經網路 (GNN) 的自動設計,以符合特定下游圖形學習任務。然而,現有的 GNAS 方法通常需要手動調整至新的圖形搜尋空間,這需要大量的程式碼最佳化和領域特定知識。為了應對這項挑戰,我們提出 LLM4GNAS,一個利用大型語言模型 (LLM) 的生成能力的 GNAS 工具包。LLM4GNAS 包含一個基於 LLM 的圖形神經架構搜尋演算法函式庫,讓 GNAS 方法能夠透過修改 LLM 提示來適應新的搜尋空間。這種方法減少了演算法適應和程式碼修改中手動介入的需要。LLM4GNAS 工具包具有可擴充性和穩健性,整合了 LLM 增強的圖形特徵工程、LLM 增強的圖形神經架構搜尋和 LLM 增強的超參數最佳化。實驗結果表明,LLM4GNAS 在涉及同質和異質圖形的任務上優於現有的 GNAS 方法。 + +##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning** +2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari + +Identifying cause-and-effect relationships is critical to understanding +real-world dynamics and ultimately causal reasoning. Existing methods for +identifying event causality in NLP, including those based on Large Language +Models (LLMs), exhibit difficulties in out-of-distribution settings due to the +limited scale and heavy reliance on lexical cues within available benchmarks. +Modern benchmarks, inspired by probabilistic causal inference, have attempted +to construct causal graphs of events as a robust representation of causal +knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent +benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a +benchmark designed for discovery and reasoning over abstract causal events. +Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday +life events on the abstraction level. We propose a pipeline for identifying +abstractions for event generalizations from \texttt{GLUCOSE} +\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit +commonsense causal knowledge, from which we subsequently extract $1,4$K causal +pairs. Our experiments highlight the ongoing challenges of using statistical +methods and/or LLMs for automatic abstraction identification and causal +discovery in NLP. Nonetheless, we demonstrate that the abstract causal +knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA +reasoning performance in LLMs. + +摘要:找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法,包括基於大型語言模型 (LLM) 的方法,由於規模有限且過度依賴於可用基準中的詞彙線索,在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖,作為因果知識的強健表示,其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中,我們介紹 \texttt{ACCESS},一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同,\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道,用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象,\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集,我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此,我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。 + +##### **Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality** +2502.09658v1 by Xin Kang, Veronika Shteingardt, Yuhan Wang, Dov Dori + +Knowledge representation and reasoning are critical challenges in Artificial +Intelligence (AI), particularly in integrating neural and symbolic approaches +to achieve explainable and transparent AI systems. Traditional knowledge +representation methods often fall short of capturing complex processes and +state changes. We introduce Neuro-Conceptual Artificial Intelligence (NCAI), a +specialization of the neuro-symbolic AI approach that integrates conceptual +modeling using Object-Process Methodology (OPM) ISO 19450:2024 with deep +learning to enhance question-answering (QA) quality. By converting natural +language text into OPM models using in-context learning, NCAI leverages the +expressive power of OPM to represent complex OPM elements-processes, objects, +and states-beyond what traditional triplet-based knowledge graphs can easily +capture. This rich structured knowledge representation improves reasoning +transparency and answer accuracy in an OPM-QA system. We further propose +transparency evaluation metrics to quantitatively measure how faithfully the +predicted reasoning aligns with OPM-based conceptual logic. Our experiments +demonstrate that NCAI outperforms traditional methods, highlighting its +potential for advancing neuro-symbolic AI by providing rich knowledge +representations, measurable transparency, and improved reasoning. + +摘要:知識表徵與推理是人工智慧 (AI) 中的重大挑戰,特別是在整合神經與符號方法以實現可解釋且透明的人工智慧系統時。傳統的知識表徵方法通常無法捕捉複雜的流程和狀態變化。我們引入了神經概念人工智慧 (NCAI),一種神經符號 AI 方法的專門化,它將使用物件流程方法 (OPM) ISO 19450:2024 的概念建模與深度學習整合在一起,以提升問答 (QA) 的品質。透過使用情境學習將自然語言文字轉換為 OPM 模型,NCAI 充分利用 OPM 的表達能力來表徵複雜的 OPM 元素(流程、物件和狀態),超越傳統的三元組知識圖表容易捕捉的範圍。這種豐富的結構化知識表徵改善了 OPM-QA 系統中的推理透明度和答案準確度。我們進一步提出了透明度評估指標,以量化測量預測推理與基於 OPM 的概念邏輯的吻合程度。我們的實驗證明,NCAI 優於傳統方法,突顯了它在透過提供豐富的知識表徵、可測量的透明度和改善的推理來推進神經符號 AI 的潛力。 + +##### **GCoT: Chain-of-Thought Prompt Learning for Graphs** +2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang + +Chain-of-thought (CoT) prompting has achieved remarkable success in natural +language processing (NLP). However, its vast potential remains largely +unexplored for graphs. This raises an interesting question: How can we design +CoT prompting for graphs to guide graph models to learn step by step? On one +hand, unlike natural languages, graphs are non-linear and characterized by +complex topological structures. On the other hand, many graphs lack textual +data, making it difficult to formulate language-based CoT prompting. In this +work, we propose the first CoT prompt learning framework for text-free graphs, +GCoT. Specifically, we decompose the adaptation process for each downstream +task into a series of inference steps, with each step consisting of +prompt-based inference, ``thought'' generation, and thought-conditioned prompt +learning. While the steps mimic CoT prompting in NLP, the exact mechanism +differs significantly. Specifically, at each step, an input graph, along with a +prompt, is first fed into a pre-trained graph encoder for prompt-based +inference. We then aggregate the hidden layers of the encoder to construct a +``thought'', which captures the working state of each node in the current step. +Conditioned on this thought, we learn a prompt specific to each node based on +the current state. These prompts are fed into the next inference step, +repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we +conduct comprehensive experiments on eight public datasets, which demonstrate +the advantage of our approach. + +摘要:鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而,其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題:我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習?一方面,與自然語言不同,圖形是非線性的,並且具有複雜的拓撲結構。另一方面,許多圖形缺乏文本數據,這使得難以制定基於語言的 CoT 提示。在這項工作中,我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說,我們將每個下游任務的適應過程分解為一系列推理步驟,每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示,但具體機制卻有很大不同。具體來說,在每一步中,一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後,我們聚合編碼器的隱藏層以構建一個「思想」,它捕獲了當前步驟中每個節點的工作狀態。基於這個思想,我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中,重複這個循環。為了評估和分析 GCoT 的有效性,我們對八個公共數據集進行了全面的實驗,這證明了我們方法的優勢。 + +##### **Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach** +2502.10453v1 by Régnier Avice, Bernhard Haslhofer, Zhidong Li, Jianlong Zhou + +Attribution tags form the foundation of modern cryptoasset forensics. +However, inconsistent or incorrect tags can mislead investigations and even +result in false accusations. To address this issue, we propose a novel +computational method based on Large Language Models (LLMs) to link attribution +tags with well-defined knowledge graph concepts. We implemented this method in +an end-to-end pipeline and conducted experiments showing that our approach +outperforms baseline methods by up to 37.4% in F1-score across three publicly +available attribution tag datasets. By integrating concept filtering and +blocking procedures, we generate candidate sets containing five knowledge graph +entities, achieving a recall of 93% without the need for labeled data. +Additionally, we demonstrate that local LLM models can achieve F1-scores of +90%, comparable to remote models which achieve 94%. We also analyze the +cost-performance trade-offs of various LLMs and prompt templates, showing that +selecting the most cost-effective configuration can reduce costs by 90%, with +only a 1% decrease in performance. Our method not only enhances attribution tag +quality but also serves as a blueprint for fostering more reliable forensic +evidence. -摘要:腦電圖訊號傳達了關於大腦活動的重要資訊,無論是在健康或病理狀況下。然而,它們本質上是有雜訊的,這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效,但通常需要大量的專家介入。本研究提出 LSTEEG,一種新穎的基於 LSTM 的自動編碼器,用於偵測和校正腦電圖訊號中的人工製品。利用深度學習,特別是 LSTM 層,LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比,LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性,讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域,並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。 +摘要:歸因標籤構成現代加密資產鑑識的基礎。 +然而,不一致或不正確的標籤會誤導調查,甚至導致錯誤的指控。為了解決這個問題,我們提出了一種基於大型語言模型 (LLM) 的新型計算方法,將歸因標籤與定義明確的知識圖譜概念連結起來。我們在端到端管道中實施了這種方法,並進行了實驗,結果顯示我們的做法在三個公開可用的歸因標籤資料集中,F1 分數比基線方法高出 37.4%。透過整合概念過濾和封鎖程序,我們生成了包含五個知識圖譜實體的候選集,在不需要標籤資料的情況下,達到了 93% 的召回率。 +此外,我們證明了本機 LLM 模型可以達到 90% 的 F1 分數,與達到 94% 的遠端模型相當。我們也分析了各種 LLM 和提示範本的成本效益權衡,結果顯示選擇最具成本效益的設定可以將成本降低 90%,而效能只下降 1%。我們的做法不僅提升了歸因標籤的品質,也作為促進更可靠鑑識證據的藍圖。 -##### **SycEval: Evaluating LLM Sycophancy** -2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo +##### **Deep Semantic Graph Learning via LLM based Node Enhancement** +2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen -Large language models (LLMs) are increasingly applied in educational, -clinical, and professional settings, but their tendency for sycophancy -- -prioritizing user agreement over independent reasoning -- poses risks to -reliability. This study introduces a framework to evaluate sycophantic behavior -in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and -MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19% -of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the -lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred -in 43.52% of cases, while regressive sycophancy, leading to incorrect answers, -was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher -sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$, -$p<0.001$), particularly in computational tasks, where regressive sycophancy -increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$). -Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while -citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$, -$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI: -[77.2%, 79.8%]) regardless of context or model. These findings emphasize the -risks and opportunities of deploying LLMs in structured and dynamic domains, -offering insights into prompt programming and model optimization for safer AI -applications. +Graph learning has attracted significant attention due to its widespread +real-world applications. Current mainstream approaches rely on text node +features and obtain initial node embeddings through shallow embedding learning +using GNNs, which shows limitations in capturing deep textual semantics. Recent +advances in Large Language Models (LLMs) have demonstrated superior +capabilities in understanding text semantics, transforming traditional text +feature processing. This paper proposes a novel framework that combines Graph +Transformer architecture with LLM-enhanced node features. Specifically, we +leverage LLMs to generate rich semantic representations of text nodes, which +are then processed by a multi-head self-attention mechanism in the Graph +Transformer to capture both local and global graph structural information. Our +model utilizes the Transformer's attention mechanism to dynamically aggregate +neighborhood information while preserving the semantic richness provided by LLM +embeddings. Experimental results demonstrate that the LLM-enhanced node +features significantly improve the performance of graph learning models on node +classification tasks. This approach shows promising results across multiple +graph learning tasks, offering a practical direction for combining graph +networks with language models. -摘要:大型語言模型(LLM)日益應用於教育、臨床和專業領域,但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為,涉及 AMPS(數學)和 MedQuad(醫療建議)數據集。在 58.19% 的案例中觀察到了趨炎附勢行為,其中 Gemini 表現出最高比率(62.47%),而 ChatGPT 最低(56.71%)。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中,而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率(61.75% 對 56.52%,Z=5.87,p<0.001),特別是在計算任務中,其中退步式趨炎附勢顯著增加(先發制人:8.13%,上下文:3.54%,p<0.001)。簡單的反駁最大化了漸進式趨炎附勢(Z=6.59,p<0.001),而基於引用的反駁表現出最高的退步式比率(Z=6.59,p<0.001)。趨炎附勢行為表現出很高的持續性(78.5%,95% CI:[77.2%,79.8%]),無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇,為更安全的 AI 應用提供了提示編程和模型優化的見解。 +摘要:圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵,並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入,這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力,轉換了傳統的文本特徵處理。本文提出了一種新的框架,將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說,我們利用 LLM 來生成文本節點的豐富語義表示,然後在圖形轉換器中由多頭自我注意機制處理,以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息,同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明,LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果,為將圖形網絡與語言模型相結合提供了實用的方向。 -##### **Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models** -2502.09659v1 by Hasin Rehana, Jie Zheng, Leo Yeh, Benu Bansal, Nur Bengisu Çam, Christianah Jemiyo, Brett McGregor, Arzucan Özgür, Yongqun He, Junguk Hur +##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping** +2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia -Motivation: An adjuvant is a chemical incorporated into vaccines that -enhances their efficacy by improving the immune response. Identifying adjuvant -names from cancer vaccine studies is essential for furthering research and -enhancing immunotherapies. However, the manual curation from the constantly -expanding biomedical literature poses significant challenges. This study -explores the automated recognition of vaccine adjuvant names using Large -Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) -and Large Language Model Meta AI (Llama). Methods: We utilized two datasets: 97 -clinical trial records from AdjuvareDB and 290 abstracts annotated with the -Vaccine Adjuvant Compendium (VAC). GPT-4o and Llama 3.2 were employed in -zero-shot and few-shot learning paradigms with up to four examples per prompt. -Prompts explicitly targeted adjuvant names, testing the impact of contextual -information such as substances or interventions. Outputs underwent automated -and manual validation for accuracy and consistency. Results: GPT-4o attained -100% Precision across all situations while exhibiting notable improve in Recall -and F1-scores, particularly with incorporating interventions. On the VAC -dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, -surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o -reached an F1-score of 81.67% for three-shot prompting with interventions, -surpassing Llama-3.2-3 B's maximum F1-score of 65.62%. Conclusion: Our findings -demonstrate that LLMs excel at identifying adjuvant names, including rare -variations of naming representation. This study emphasizes the capability of -LLMs to enhance cancer vaccine development by efficiently extracting insights. -Future work aims to broaden the framework to encompass various biomedical -literature and enhance model generalizability across various vaccines and -adjuvants. +The prototyping of computer games, particularly card games, requires +extensive human effort in creative ideation and gameplay evaluation. Recent +advances in Large Language Models (LLMs) offer opportunities to automate and +streamline these processes. However, it remains challenging for LLMs to design +novel game mechanics beyond existing databases, generate consistent gameplay +environments, and develop scalable gameplay AI for large-scale evaluations. +This paper addresses these challenges by introducing a comprehensive automated +card game prototyping framework. The approach highlights a graph-based indexing +method for generating novel game designs, an LLM-driven system for consistent +game code generation validated by gameplay records, and a gameplay AI +constructing method that uses an ensemble of LLM-generated action-value +functions optimized through self-play. These contributions aim to accelerate +card game prototyping, reduce human labor, and lower barriers to entry for game +developers. -摘要:動機:佐劑是一種加入疫苗的化學物質,能藉由改善免疫反應來提升疫苗的效力。從癌症疫苗研究中找出佐劑名稱對於推進研究和改善免疫療法至關重要。然而,從不斷擴展的生物醫學文獻中手動整理會造成重大挑戰。本研究探討使用大型語言模型 (LLM),特別是生成式預訓練Transformer (GPT) 和大型語言模型 Meta AI (Llama) 來自動辨識疫苗佐劑名稱。方法:我們使用兩個資料集:來自 AdjuvareDB 的 97 份臨床試驗記錄和 290 篇標註了疫苗佐劑彙編 (VAC) 的摘要。GPT-4o 和 Llama 3.2 被用於零次學習和少量學習範例,每個提示最多有四個範例。提示明確鎖定佐劑名稱,測試物質或介入措施等背景資訊的影響。輸出經過自動和手動驗證,以確保準確性和一致性。結果:GPT-4o 在所有情況下都達到 100% 的準確率,同時在召回率和 F1 分數上表現出顯著的進步,特別是在納入介入措施的情況下。在 VAC 資料集上,GPT-4o 在有介入措施的情況下達到 77.32% 的最高 F1 分數,比 Llama-3.2-3B 高出約 2%。在 AdjuvareDB 資料集上,GPT-4o 在有介入措施的三次提示中達到 81.67% 的 F1 分數,超過 Llama-3.2-3 B 的最高 F1 分數 65.62%。結論:我們的研究結果表明,LLM 在辨識佐劑名稱方面表現出色,包括命名表示的罕見變異。本研究強調了 LLM 在有效提取見解方面增強癌症疫苗開發的能力。未來的研究工作旨在擴大架構,涵蓋各種生物醫學文獻,並增強模型在各種疫苗和佐劑中的泛化能力。 +摘要:電腦遊戲,尤其是卡牌遊戲的原型製作,需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而,LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境,以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法,用於生成新穎的遊戲設計,一個由 LLM 驅動的系統,用於一致的遊戲程式碼生成,並由遊戲記錄驗證,以及一個遊戲 AI 構建方法,該方法使用由 LLM 生成的動作值函數的集合,通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作,減少人力,並降低遊戲開發人員的進入門檻。 -##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?** -2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace +##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units** +2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan -Medical research faces well-documented challenges in translating novel -treatments into clinical practice. Publishing incentives encourage researchers -to present "positive" findings, even when empirical results are equivocal. -Consequently, it is well-documented that authors often spin study results, -especially in article abstracts. Such spin can influence clinician -interpretation of evidence and may affect patient care decisions. In this -study, we ask whether the interpretation of trial results offered by Large -Language Models (LLMs) is similarly affected by spin. This is important since -LLMs are increasingly being used to trawl through and synthesize published -medical evidence. We evaluated 22 LLMs and found that they are across the board -more susceptible to spin than humans. They might also propagate spin into their -outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into -plain language summaries that they generate. We also find, however, that LLMs -are generally capable of recognizing spin, and can be prompted in a way to -mitigate spin's impact on LLM outputs. +Graph Neural Networks (GNNs) are vital for learning from graph-structured +data, enabling applications in network analysis, recommendation systems, and +speech analytics. Deploying them on edge devices like client PCs and laptops +enhances real-time processing, privacy, and cloud independence. GNNs aid +Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and +enable event-based vision tasks. However, irregular memory access, sparsity, +and dynamic structures cause high latency and energy overhead on +resource-constrained devices. While modern edge processors integrate CPUs, +GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular +GNN computations. We introduce GraNNite, the first hardware-aware framework +optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN +accelerators via a structured three-step methodology: (1) enabling NPU +execution, (2) optimizing performance, and (3) trading accuracy for efficiency +gains. Step 1 employs GraphSplit for workload distribution and StaGr for static +aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts +performance using EffOp for control-heavy tasks and GraSp for sparsity +exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce +redundancy and memory transfers. Step 3 balances quality versus efficiency, +where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate +attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, +GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to +8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher +performance than CPUs and GPUs, respectively, across GNN models. -摘要:醫學研究在將新穎療法轉化為臨床實務上,面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現,即使經驗結果模稜兩可。因此,有據可查的是,作者經常扭曲研究結果,特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋,並可能影響病患照護決策。在本研究中,我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據,因此這點非常重要。我們評估了 22 個 LLM,發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中:例如,我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而,我們也發現 LLM 通常有能力辨認扭曲,而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。 +摘要:圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要,能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置(例如用戶端電腦和筆電)上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG),並支援基於事件的視覺任務。然而,不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU,但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite,這是第一個硬體感知框架,透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行:(1) 啟用 NPU 執行,(2) 最佳化效能,以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配,並使用 StaGr 進行靜態聚合,而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能,並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率,其中 QuantGr 適用 INT8 量化,而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上,GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速,在 CPU 和 GPU 上實現了高達 8.6X 的能源增益,在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。 -##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating** -2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri +##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language** +2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin -This paper presents a novel Natural Language Processing (NLP) framework for -enhancing medical diagnosis through the integration of advanced techniques in -data augmentation, feature extraction, and classification. The proposed -approach employs back-translation to generate diverse paraphrased datasets, -improving robustness and mitigating overfitting in classification tasks. -Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with -Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained -contextual and positional relationships, dynamically adjusting the influence of -positional information based on semantic context to produce high-quality text -embeddings. For classification, an Attention-Based Feedforward Neural Network -(ABFNN) is utilized, effectively focusing on the most relevant features to -improve decision-making accuracy. Applied to the classification of symptoms, -clinical notes, and other medical texts, this architecture demonstrates its -ability to address the complexities of medical data. The combination of data -augmentation, contextual embedding generation, and advanced classification -mechanisms offers a robust and accurate diagnostic tool, with potential -applications in automated medical diagnosis and clinical decision support. This -method demonstrates the effectiveness of the proposed NLP framework for medical -diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of -99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only -underscore the model's robust performance in classifying medical texts with -exceptional precision and reliability but also highlight its superiority over -existing methods, making it a highly promising tool for automated diagnostic -systems. +Recent advancements in AI for biological research focus on integrating +molecular data with natural language to accelerate drug discovery. However, the +scarcity of high-quality annotations limits progress in this area. This paper +introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework +that leverages large language models to augment existing datasets, thereby +improving AI training. We demonstrate the effectiveness of LA$^3$ by creating +an enhanced dataset, LaChEBI-20, where we systematically rewrite the +annotations of molecules from an established dataset. These rewritten +annotations preserve essential molecular information while providing more +varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 +based on a benchmark architecture to learn the mapping between molecular +representations and augmented annotations. + Experimental results on text-based *de novo* molecule generation and molecule +captioning demonstrate that LaMolT5 outperforms state-of-the-art models. +Notably, incorporating LA$^3$ leads to improvements of up to 301% over the +benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ +notable applications in *image*, *text* and *graph* tasks, affirming its +versatility and utility. -摘要:本文提出了一個創新的自然語言處理 (NLP) 框架,透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集,提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa),這個模型捕捉細緻的脈絡和位置關係,根據語意脈絡動態調整位置資訊的影響,以產生高品質的文字嵌入。在分類方面,利用基於注意力的前饋神經網路 (ABFNN),有效地關注最相關的特徵,以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類,此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具,在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性,以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數,取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性,也突顯了它優於現有方法的優越性,使其成為自動化診斷系統中極具前景的工具。 +摘要:人工智慧在生物研究上的最新進展,專注於將分子資料與自然語言整合,以加速藥物發現。然而,高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$,一個基於語言的自動註解擴充框架,它利用大型語言模型來擴充現有的資料集,進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性,我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊,同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20,我們在基於基準架構上訓練 LaMolT5,以學習分子表示和擴充註解之間的對應。 +在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明,LaMolT5 優於最先進的模型。值得注意的是,納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外,我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性,肯定了它的多功能性和實用性。 -##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension** -2502.07752v2 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds +##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment** +2502.06472v1 by Yuxing Lu, Jinzhuo Wang -Designing efficient optimizers for large language models (LLMs) with -low-memory requirements and fast convergence is an important and challenging -problem. This paper makes a step towards the systematic design of such -optimizers through the lens of structured Fisher information matrix (FIM) -approximation. We show that many state-of-the-art efficient optimizers can be -viewed as solutions to FIM approximation (under the Frobenius norm) with -specific structural assumptions. Building on these insights, we propose two -design recommendations of practical efficient optimizers for LLMs, involving -the careful selection of structural assumptions to balance generality and -efficiency, and enhancing memory efficiency of optimizers with general -structures through a novel low-rank extension framework. We demonstrate how to -use each design approach by deriving new memory-efficient optimizers: Row and -Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation -(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the -effectiveness, showing faster and better convergence than existing -memory-efficient baselines and Adam with little memory overhead. Notably, Alice -achieves better than 2x faster convergence over Adam, while RACS delivers -strong performance on the 1B model with SGD-like memory. +Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical +for modern AI systems, but manual curation struggles to scale with the rapid +growth of scientific literature. This paper presents KARMA, a novel framework +employing multi-agent large language models (LLMs) to automate KG enrichment +through structured analysis of unstructured text. Our approach employs nine +collaborative agents, spanning entity discovery, relation extraction, schema +alignment, and conflict resolution that iteratively parse documents, verify +extracted knowledge, and integrate it into existing graph structures while +adhering to domain-specific schema. Experiments on 1,200 PubMed articles from +three different domains demonstrate the effectiveness of KARMA in knowledge +graph enrichment, with the identification of up to 38,230 new entities while +achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% +through multi-layer assessments. -摘要:設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的觀點,朝著系統化設計此類最佳化器邁出了一步。我們證明許多最先進的高效最佳化器可以視為 FIM 近似(在 Frobenius 範數下)的解,並具有特定的結構假設。基於這些見解,我們提出了 LLM 的兩個實用高效最佳化器設計建議,包括仔細選擇結構假設以平衡通用性和效率,以及透過新穎的低秩擴充框架增強一般結構最佳化器的記憶體效率。我們透過推導新的記憶體高效最佳化器來展示如何使用每種設計方法:列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練(高達 1B 參數)上的實驗驗證了其有效性,顯示比現有的記憶體高效基準和 Adam 更快且更好的收斂,且記憶體開銷很小。值得注意的是,Alice 的收斂速度比 Adam 快 2 倍以上,而 RACS 則在 1B 模型上提供類似 SGD 的記憶體的強勁效能。 +摘要:維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要,但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA,一個採用多代理大型語言模型 (LLM) 的新框架,透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理,涵蓋實體發現、關係提取、架構比對和衝突解決,這些代理會反覆分析文件、驗證提取的知識,並將其整合到現有的圖結構中,同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性,識別出多達 38,230 個新實體,同時達到 83.1% 的 LLM 驗證正確性,並透過多層評估將衝突邊緣降低了 18.6%。 -##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation** -2502.07516v2 by Raman Dutt +##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs** +2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang -Generative models, particularly text-to-image (T2I) diffusion models, play a -crucial role in medical image analysis. However, these models are prone to -training data memorization, posing significant risks to patient privacy. -Synthetic chest X-ray generation is one of the most common applications in -medical image analysis with the MIMIC-CXR dataset serving as the primary data -repository for this task. This study presents the first systematic attempt to -identify prompts and text tokens in MIMIC-CXR that contribute the most to -training data memorization. Our analysis reveals two unexpected findings: (1) -prompts containing traces of de-identification procedures (markers introduced -to hide Protected Health Information) are the most memorized, and (2) among all -tokens, de-identification markers contribute the most towards memorization. -This highlights a broader issue with the standard anonymization practices and -T2I synthesis with MIMIC-CXR. To exacerbate, existing inference-time -memorization mitigation strategies are ineffective and fail to sufficiently -reduce the model's reliance on memorized text tokens. On this front, we propose -actionable strategies for different stakeholders to enhance privacy and improve -the reliability of generative models in medical imaging. Finally, our results -provide a foundation for future work on developing and benchmarking -memorization mitigation techniques for synthetic chest X-ray generation using -the MIMIC-CXR dataset. The anonymized code is available at -https://anonymous.4open.science/r/diffusion_memorization-8011/ +Mitigating positional bias of language models (LMs) for listwise inputs is a +well-known and important problem (e.g., lost-in-the-middle). While zero-shot +order-invariant LMs have been proposed to solve this issue, their success on +practical listwise problems has been limited. In this work, as a first +contribution, we identify and overcome two limitations to make zero-shot +invariant LMs more practical: (1) training and inference distribution mismatch +arising from modifying positional ID assignments to enforce invariance, and (2) +failure to adapt to a mixture of order-invariant and sensitive inputs in +practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot +invariant LM for genuinely order-invariant inputs with minimal modifications of +positional IDs, and (2) Selective Routing, an adaptive framework that handles +both order-invariant and order-sensitive inputs in listwise tasks. On the Lost +in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU +benchmarks, we show that RoToR with Selective Routing can effectively handle +practical listwise input tasks in a zero-shot manner. -摘要:生成模型,尤其是文本到影像 (T2I) 擴散模型在醫學影像分析中扮演著至關重要的角色。然而,這些模型容易訓練資料記憶,對病患隱私構成重大風險。合成胸部 X 光影像生成是醫學影像分析中最常見的應用之一,而 MIMIC-CXR 資料集則作為此任務的主要資料儲存庫。本研究提出了第一個系統化的嘗試,以識別 MIMIC-CXR 中對訓練資料記憶貢獻最大的提示和文字代碼。我們的分析揭示了兩個出乎意料的發現:(1) 包含去識別程序痕跡的提示(用於隱藏受保護健康資訊的標記)是最容易被記憶的,以及 (2) 在所有代碼中,去識別標記對記憶的貢獻最大。這突顯了標準匿名化實務和使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。更糟的是,現有的推論時間記憶減緩策略無效,無法充分降低模型對記憶文字代碼的依賴。在這個方面,我們針對不同的利害關係人提出可行的策略,以增強隱私和改善生成模型在醫學影像中的可靠性。最後,我們的結果為未來開發和評量使用 MIMIC-CXR 資料集進行合成胸部 X 光影像生成的記憶減緩技術奠定了基礎。已匿名化的程式碼可在 https://anonymous.4open.science/r/diffusion_memorization-8011/ 取得。 +摘要:語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題(例如,迷失在中間)。雖然已經提出零次學習順序不變的 LM 來解決這個問題,但它們在實際列表問題上的成功卻很有限。在這項工作中,作為第一個貢獻,我們找出並克服了兩個限制,讓零次學習不變的 LM 更有實用性:(1) 訓練和推論分布不匹配,這是由於修改位置 ID 分配以強制不變性所造成的,以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題,我們提出 (1) RoToR,一個零次學習不變的 LM,用於真正不變的輸入,並對位置 ID 進行最小的修改,以及 (2) 選擇性路由,一個自適應框架,用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中,我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。 -##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level** -2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo +##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model** +2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen -Chronic kidney disease (CKD) is a major global health issue, affecting over -10% of the population and causing significant mortality. While kidney biopsy -remains the gold standard for CKD diagnosis and treatment, the lack of -comprehensive benchmarks for kidney pathology segmentation hinders progress in -the field. To address this, we organized the Kidney Pathology Image -Segmentation (KPIs) Challenge, introducing a dataset that incorporates -preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ -Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes -two tasks, patch-level segmentation and whole slide image segmentation and -detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. -By encouraging innovative segmentation methods that adapt to diverse CKD models -and tissue conditions, the KPIs Challenge aims to advance kidney pathology -analysis, establish new benchmarks, and enable precise, large-scale -quantification for disease research and diagnosis. +Recent advancements in large language models (LLMs) have significantly +improved various natural language processing (NLP) tasks. Typically, LLMs are +trained to predict the next token, aligning well with many NLP tasks. However, +in knowledge graph (KG) scenarios, entities are the fundamental units and +identifying an entity requires at least several tokens. This leads to a +granularity mismatch between KGs and natural languages. To address this issue, +we propose K-ON, which integrates KG knowledge into the LLM by employing +multiple head layers for next k-step prediction. K-ON can not only generate +entity-level results in one step, but also enables contrastive loss against +entities, which is the most powerful tool in KG representation learning. +Experimental results show that K-ON outperforms state-of-the-art methods that +incorporate text and even the other modalities. -摘要:慢性腎臟病 (CKD) 是全球主要的健康問題,影響超過 -10% 的人口,並造成顯著的死亡率。雖然腎臟活檢 -仍然是 CKD 診斷和治療的黃金標準,但缺乏 -腎臟病理學分割的全面基準阻礙了該領域的進展。 -為了解決這個問題,我們組織了腎臟病理影像 -分割 (KPIs) 挑戰,引入了包含超過 10,000 個註解的 -CKD 臨床前嚙齒動物模型的資料集,這些註解來自 60 多個 -週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括 -兩個任務,修補層級分割和全幻燈片影像分割和 -偵測,使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。 -通過鼓勵創新的分割方法來適應不同的 CKD 模型 -和組織條件,KPIs 挑戰旨在推進腎臟病理 -分析,建立新的基準,並實現精確、大規模的 -疾病研究和診斷量化。 +摘要:大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常,LLM 會接受訓練以預測下一個符號,這與許多 NLP 任務非常吻合。然而,在知識圖譜 (KG) 場景中,實體是基本單位,而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題,我們提出了 K-ON,它透過採用多個頭部層進行下一個 k 步預測,將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果,還能針對實體啟用對比損失,這是 KG 表示學習中最有力的工具。實驗結果顯示,K-ON 優於將文字甚至其他方式納入考量的最新方法。 -##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer** -2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu +##### **LegalViz: Legal Text Visualization by Text To Diagram Generation** +2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita -Early prediction of pediatric cardiac arrest (CA) is critical for timely -intervention in high-risk intensive care settings. We introduce PedCA-FT, a -novel transformer-based framework that fuses tabular view of EHR with the -derived textual view of EHR to fully unleash the interactions of -high-dimensional risk factors and their dynamics. By employing dedicated -transformer modules for each modality view, PedCA-FT captures complex temporal -and contextual patterns to produce robust CA risk estimates. Evaluated on a -curated pediatric cohort from the CHOA-CICU database, our approach outperforms -ten other artificial intelligence models across five key performance metrics -and identifies clinically meaningful risk factors. These findings underscore -the potential of multimodal fusion techniques to enhance early CA detection and -improve patient care. +Legal documents including judgments and court orders require highly +sophisticated legal knowledge for understanding. To disclose expert knowledge +for non-experts, we explore the problem of visualizing legal texts with +easy-to-understand diagrams and propose a novel dataset of LegalViz with 23 +languages and 7,010 cases of legal document and visualization pairs, using the +DOT graph description language of Graphviz. LegalViz provides a simple diagram +from a complicated legal corpus identifying legal entities, transactions, legal +sources, and statements at a glance, that are essential in each judgment. In +addition, we provide new evaluation metrics for the legal diagram visualization +by considering graph structures, textual similarities, and legal contents. We +conducted empirical studies on few-shot and finetuning large language models +for generating legal diagrams and evaluated them with these metrics, including +legal content-based evaluation within 23 languages. Models trained with +LegalViz outperform existing models including GPTs, confirming the +effectiveness of our dataset. -摘要:早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT,一個新穎的基於轉換器的框架,它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起,以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組,PedCA-FT 捕獲複雜的時間和上下文模式,以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估,我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型,並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。 +摘要:法律文件,包括判決和法院命令,需要高度專業的法律知識才能理解。為了向非專家揭露專家知識,我們探討了使用易於理解的圖表將法律文本視覺化的問題,並提出了一個新的 LegalViz 數據集,其中包含 23 種語言和 7,010 個法律文件和視覺化配對,使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表,可以一目了然地識別法律實體、交易、法律來源和陳述,這些在每項判決中都是必不可少的。此外,我們通過考慮圖形結構、文本相似性和法律內容,為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究,以生成法律圖表,並使用這些指標對它們進行了評估,包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型,包括 GPT,證實了我們數據集的有效性。 -##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals** -2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari +##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs** +2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee -Counterfactual explanations in medical imaging are critical for understanding -the predictions made by deep learning models. We extend the Latent Shift -counterfactual generation method from 2D applications to 3D computed tomography -(CT) scans. We address the challenges associated with 3D data, such as limited -training samples and high memory demands, by implementing a slice-based -approach. This method leverages a 2D encoder trained on CT slices, which are -subsequently combined to maintain 3D context. We demonstrate this technique on -two models for clinical phenotype prediction and lung segmentation. Our -approach is both memory-efficient and effective for generating interpretable -counterfactuals in high-resolution 3D medical imaging. +Mental-illness stigma is a persistent social problem, hampering both +treatment-seeking and recovery. Accordingly, there is a pressing need to +understand it more clearly, but analyzing the relevant data is highly +labor-intensive. Therefore, we designed a chatbot to engage participants in +conversations; coded those conversations qualitatively with AI assistance; and, +based on those coding results, built causal knowledge graphs to decode stigma. +The results we obtained from 1,002 participants demonstrate that conversation +with our chatbot can elicit rich information about people's attitudes toward +depression, while our AI-assisted coding was strongly consistent with +human-expert coding. Our novel approach combining large language models (LLMs) +and causal knowledge graphs uncovered patterns in individual responses and +illustrated the interrelationships of psychological constructs in the dataset +as a whole. The paper also discusses these findings' implications for HCI +researchers in developing digital interventions, decomposing human +psychological constructs, and fostering inclusive attitudes. -摘要:反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法,來解決與 3D 資料相關的挑戰,例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器,隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實,既節省記憶體又有效。 +摘要:精神疾病的污名化是一個持續存在的社會問題,阻礙了尋求治療和康復。因此,迫切需要更清楚地了解它,但分析相關數據非常費力。因此,我們設計了一個聊天機器人,讓參與者參與對話;使用 AI 協助對這些對話進行定性編碼;並根據這些編碼結果,構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明,與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊,而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式,並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。 -##### **Interactive Data Harmonization with LLM Agents** -2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire +##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification** +2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya -Data harmonization is an essential task that entails integrating datasets -from diverse sources. Despite years of research in this area, it remains a -time-consuming and challenging task due to schema mismatches, varying -terminologies, and differences in data collection methodologies. This paper -presents the case for agentic data harmonization as a means to both empower -experts to harmonize their data and to streamline the process. We introduce -Harmonia, a system that combines LLM-based reasoning, an interactive user -interface, and a library of data harmonization primitives to automate the -synthesis of data harmonization pipelines. We demonstrate Harmonia in a -clinical data harmonization scenario, where it helps to interactively create -reusable pipelines that map datasets to a standard format. Finally, we discuss -challenges and open problems, and suggest research directions for advancing our -vision. +In this paper, we address the task of semantic segmentation of legal +documents through rhetorical role classification, with a focus on Indian legal +judgments. We introduce LegalSeg, the largest annotated dataset for this task, +comprising over 7,000 documents and 1.4 million sentences, labeled with 7 +rhetorical roles. To benchmark performance, we evaluate multiple +state-of-the-art models, including Hierarchical BiLSTM-CRF, +TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and +Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an +instruction-tuned large language model. Our results demonstrate that models +incorporating broader context, structural relationships, and sequential +sentence information outperform those relying solely on sentence-level +features. Additionally, we conducted experiments using surrounding context and +predicted or actual labels of neighboring sentences to assess their impact on +classification accuracy. Despite these advancements, challenges persist in +distinguishing between closely related roles and addressing class imbalance. +Our work underscores the potential of advanced techniques for improving legal +document understanding and sets a strong foundation for future research in +legal NLP. -摘要:資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷,但由於架構不匹配、術語不同,以及資料收集方法的差異,它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和,作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia,一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統,以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia,它有助於互動式建立可重複使用的管線,將資料集對應至標準格式。最後,我們討論挑戰和開放性問題,並建議研究方向以推進我們的願景。 +摘要:在本文中,我們通過修辭角色分類來探討法律文件的語義分段任務,重點關注印度法律判決。我們引入了 LegalSeg,這是此任務中最大的註釋資料集,包含超過 7,000 份文件和 140 萬個句子,並標記了 7 個修辭角色。為了評量效能,我們評估了多個最先進的模型,包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer,以及探索性的 RhetoricLLaMA,一種經過指令調整的大型語言模型。我們的結果表明,結合廣泛背景、結構關係和順序句子資訊的模型,表現優於僅依賴句子層級特徵的模型。此外,我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗,以評估它們對分類精度的影響。儘管有這些進展,但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力,並為法律自然語言處理的未來研究奠定了堅實的基礎。 -##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML** -2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani +##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning** +2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong -Machine learning (ML) is transforming healthcare by enabling predictive -analytics, personalized treatments, and improved patient outcomes. However, -traditional ML workflows require specialized skills, infrastructure, and -resources, limiting accessibility for many healthcare professionals. This paper -explores how Google Cloud's BigQuery ML simplifies the development and -deployment of ML models using SQL, reducing technical barriers. Through a case -study on diabetes prediction using the Diabetes Health Indicators Dataset, we -evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep -Neural Network (DNN). Our results demonstrate that the Boosted Tree model -achieves the highest performance, making it highly effective for diabetes -prediction. This study highlights BigQuery ML's role in democratizing machine -learning by providing a scalable, efficient, and accessible solution for -healthcare analytics. +Developing intelligent agents for long-term cooperation in dynamic open-world +scenarios is a major challenge in multi-agent systems. Traditional Multi-agent +Reinforcement Learning (MARL) frameworks like centralized training +decentralized execution (CTDE) struggle with scalability and flexibility. They +require centralized long-term planning, which is difficult without custom +reward functions, and face challenges in processing multi-modal data. CTDE +approaches also assume fixed cooperation strategies, making them impractical in +dynamic environments where agents need to adapt and plan independently. To +address decentralized multi-agent cooperation, we propose Decentralized +Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in +a novel Multi-agent Crafter environment. Our generative agents, powered by +Large Language Models (LLMs), are more scalable than traditional MARL agents by +leveraging external knowledge and language for long-term planning and +reasoning. Instead of fully sharing information from all past experiences, +DAMCS introduces a multi-modal memory system organized as a hierarchical +knowledge graph and a structured communication protocol to optimize agent +cooperation. This allows agents to reason from past interactions and share +relevant information efficiently. Experiments on novel multi-agent open-world +tasks show that DAMCS outperforms both MARL and LLM baselines in task +efficiency and collaboration. Compared to single-agent scenarios, the two-agent +scenario achieves the same goal with 63% fewer steps, and the six-agent +scenario with 74% fewer steps, highlighting the importance of adaptive memory +and structured communication in achieving long-term goals. We publicly release +our project at: https://happyeureka.github.io/damcs. -摘要:機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果,正在轉型醫療保健。然而,傳統的 ML 工作流程需要專業技能、基礎設施和資源,限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署,降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究,我們評估了三個預測模型:邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明,提升樹模型達到了最高的效能,使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色,提供可擴充、有效率且可存取的醫療保健分析解決方案。 +摘要:在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架,例如集中式訓練去中心化執行 (CTDE),在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃,這在沒有自訂獎勵函數的情況下很難執行,並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略,這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題,我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援,透過利用外部知識和語言進行長期規劃和推理,比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊,而是引入了多模式記憶體系統,該系統組織成階層式知識圖譜和結構化通訊協定,以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明,DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比,雙重代理情境以少 63% 的步驟達成相同的目標,而六重代理情境則以少 74% 的步驟達成目標,突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於:https://happyeureka.github.io/damcs。 -##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements** -2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen +##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation** +2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang -Despite over a decade of legislative efforts to address modern slavery in the -supply chains of large corporations, the effectiveness of government oversight -remains hampered by the challenge of scrutinizing thousands of statements -annually. While Large Language Models (LLMs) can be considered a well -established solution for the automatic analysis and summarization of documents, -recognizing concrete modern slavery countermeasures taken by companies and -differentiating those from vague claims remains a challenging task. To help -evaluate and fine-tune LLMs for the assessment of corporate statements, we -introduce a dataset composed of 5,731 modern slavery statements taken from the -Australian Modern Slavery Register and annotated at the sentence level. This -paper details the construction steps for the dataset that include the careful -design of annotation specifications, the selection and preprocessing of -statements, and the creation of high-quality annotation subsets for effective -model evaluations. To demonstrate our dataset's utility, we propose a machine -learning methodology for the detection of sentences relevant to mandatory -reporting requirements set by the Australian Modern Slavery Act. We then follow -this methodology to benchmark modern language models under zero-shot and -supervised learning settings. +Graphs are able to model interconnected entities in many online services, +supporting a wide range of applications on the Web. This raises an important +question: How can we train a graph foundational model on multiple source +domains and adapt to an unseen target domain? A major obstacle is that graphs +from different domains often exhibit divergent characteristics. Some studies +leverage large language models to align multiple domains based on textual +descriptions associated with the graphs, limiting their applicability to +text-attributed graphs. For text-free graphs, a few recent works attempt to +align different feature distributions across domains, while generally +neglecting structural differences. In this work, we propose a novel Structure +Alignment framework for text-free Multi-domain Graph Pre-Training and +cross-domain adaptation (SAMGPT). It is designed to learn multi-domain +knowledge from graphs originating in multiple source domains, which can then be +adapted to address applications in an unseen target domain. Specifically, we +introduce a set of structure tokens to harmonize structure-based aggregation +across source domains during the pre-training phase. Next, for cross-domain +adaptation, we design dual prompts, namely, holistic prompts and specific +prompts, which adapt unified multi-domain structural knowledge and +fine-grained, domain-specific information, respectively, to a target domain. +Finally, we conduct comprehensive experiments on seven public datasets to +evaluate and analyze the effectiveness of SAMGPT. + +摘要:圖表能夠在許多線上服務中對相互關聯的實體進行建模, +支援網路上廣泛的應用程式。這提出了重要的問題:我們如何針對多個來源網域訓練圖表基礎模型,並適應未見過的目標網域?一個主要的障礙是,來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型,根據與圖表相關的文字描述,對齊多個網域,限制其適用性於有文字屬性的圖表。對於沒有文字的圖表,最近的一些作品嘗試對齊跨網域的不同特徵分佈,同時通常忽略結構上的差異。在這項工作中,我們提出了一個新的結構對齊框架,用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識,然後可以適應於未見過的目標網域中的應用程式。具體來說,我們引入了一組結構化代碼,以在預訓練階段,調和跨來源網域的基於結構的聚合。接下來,對於跨網域適應,我們設計了雙重提示,即整體提示和具體提示,分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後,我們在七個公共資料集上進行了全面的實驗,以評估和分析 SAMGPT 的有效性。 + +##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints** +2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang + +In-context learning (ICL) effectively conditions large language models (LLMs) +for molecular tasks, such as property prediction and molecule captioning, by +embedding carefully selected demonstration examples into the input prompt. This +approach avoids the computational overhead of extensive pertaining and +fine-tuning. However, current prompt retrieval methods for molecular tasks have +relied on molecule feature similarity, such as Morgan fingerprints, which do +not adequately capture the global molecular and atom-binding relationships. As +a result, these methods fail to represent the full complexity of molecular +structures during inference. Moreover, small-to-medium-sized LLMs, which offer +simpler deployment requirements in specialized systems, have remained largely +unexplored in the molecular ICL literature. To address these gaps, we propose a +self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context +learning, which aligns global molecular structures, represented by graph neural +networks (GNNs), with textual captions (descriptions) while leveraging local +feature similarity through Morgan fingerprints. In addition, we introduce a +Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to +optimize input prompt demonstration samples. Our experimental findings using +diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL +retrieval methods across all tasks by up to 45%. -摘要:儘管立法努力超過十年,旨在解決大型企業供應鏈中的現代奴隸制,但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型(LLM)可以被認為是文件自動分析和摘要的完善解決方案,但要辨識公司採取的具體現代奴隸制對策,並將其與含糊的聲明區分開來,仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明,我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集,這些聲明取自澳洲現代奴隸制註冊處,並在句子層級進行註解。本文詳細說明了資料集的建構步驟,其中包括註解規格的仔細設計、聲明的選擇和預處理,以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用,我們提出了一種機器學習方法,用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後,我們遵循這種方法,在零次學習和監督學習設定下對現代語言模型進行基準測試。 +摘要:情境學習 (ICL) 有效地調整大型語言模型 (LLM),以執行分子任務,例如屬性預測和分子標題,方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而,目前針對分子任務的提示檢索方法依賴於分子特徵相似性,例如 Morgan 指紋,而無法充分捕捉全局分子和原子鍵結關係。因此,這些方法無法在推理過程中表示分子結構的完整複雜性。此外,在專業系統中提供更簡單部署需求的小到中型的 LLM,在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距,我們提出了一種自我監督學習技術,GAMIC(圖形對齊分子情境學習),它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題(描述)對齊,同時透過 Morgan 指紋利用局部特徵相似性。此外,我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法,以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示,GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法,最多可達 45%。 -##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium** -2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour +##### **Knowledge Graph-Guided Retrieval Augmented Generation** +2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu -The fourth Machine Learning for Health (ML4H) symposium was held in person on -December 15th and 16th, 2024, in the traditional, ancestral, and unceded -territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, -British Columbia, Canada. The symposium included research roundtable sessions -to foster discussions between participants and senior researchers on timely and -relevant topics for the ML4H community. The organization of the research -roundtables at the conference involved 13 senior and 27 junior chairs across 13 -tables. Each roundtable session included an invited senior chair (with -substantial experience in the field), junior chairs (responsible for -facilitating the discussion), and attendees from diverse backgrounds with an -interest in the session's topic. +Retrieval-augmented generation (RAG) has emerged as a promising technology +for addressing hallucination issues in the responses generated by large +language models (LLMs). Existing studies on RAG primarily focus on applying +semantic-based approaches to retrieve isolated relevant chunks, which ignore +their intrinsic relationships. In this paper, we propose a novel Knowledge +Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes +knowledge graphs (KGs) to provide fact-level relationships between chunks, +improving the diversity and coherence of the retrieved results. Specifically, +after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG +employs a KG-guided chunk expansion process and a KG-based chunk organization +process to deliver relevant and important knowledge in well-organized +paragraphs. Extensive experiments conducted on the HotpotQA dataset and its +variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based +approaches, in terms of both response quality and retrieval quality. -摘要:第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議,以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席(在該領域擁有豐富的經驗)、初級主席(負責促進討論)以及對會議主題感興趣的來自不同背景的與會者。 +摘要:檢索增強生成 (RAG) 已成為一項有前途的技術,用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊,而忽略它們的內在關係。在本文中,我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架,它利用知識圖表 (KG) 來提供區塊之間的事實層級關係,從而提高檢索結果的多樣性和一致性。具體來說,在執行基於語義的檢索以提供種子區塊後,KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序,以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。 -##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering** -2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla +##### **Can Large Language Models Understand Intermediate Representations?** +2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan -Current Large Language Models (LLMs) benchmarks are often based on open-ended -or close-ended QA evaluations, avoiding the requirement of human labor. -Close-ended measurements evaluate the factuality of responses but lack -expressiveness. Open-ended capture the model's capacity to produce discourse -responses but are harder to assess for correctness. These two approaches are -commonly used, either independently or together, though their relationship -remains poorly understood. This work is focused on the healthcare domain, where -both factuality and discourse matter greatly. It introduces a comprehensive, -multi-axis suite for healthcare LLM evaluation, exploring correlations between -open and close benchmarks and metrics. Findings include blind spots and -overlaps in current methodologies. As an updated sanity check, we release a new -medical benchmark --CareQA-- with both open and closed variants. Finally, we -propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to -mitigate the identified limitations. +Intermediate Representations (IRs) are essential in compiler design and +program analysis, yet their comprehension by Large Language Models (LLMs) +remains underexplored. This paper presents a pioneering empirical study to +investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA +3.1, and Code Llama, in understanding IRs. We analyze their performance across +four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code +summarization, and execution reasoning. Our results indicate that while LLMs +demonstrate competence in parsing IR syntax and recognizing high-level +structures, they struggle with control flow reasoning, execution semantics, and +loop handling. Specifically, they often misinterpret branching instructions, +omit critical IR operations, and rely on heuristic-based reasoning, leading to +errors in CFG reconstruction, IR decompilation, and execution reasoning. The +study underscores the necessity for IR-specific enhancements in LLMs, +recommending fine-tuning on structured IR datasets and integration of explicit +control flow models to augment their comprehension and handling of IR-related +tasks. -摘要:當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量,避免了人力需求。封閉式測量評估回應的事實性,但缺乏表達力。開放式測量捕捉模型產生論述回應的能力,但較難評估正確性。這兩種方法通常獨立或合併使用,儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域,在該領域中,事實性和論述都非常重要。它引入了一個全面的多軸套件,用於醫療保健 LLM 評量,探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查,我們發布了一個新的醫療基準--CareQA--,包含開放式和封閉式變體。最後,我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。 +摘要:中間表徵 (IR) 在編譯器設計和程式分析中至關重要,但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究,以探討 LLM(包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama)理解 IR 的能力。我們分析了它們在四項任務中的表現:控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明,儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力,但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言,它們經常誤解分支指令、省略關鍵 IR 操作,並依賴於基於啟發式的推理,導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性,建議對結構化的 IR 資料集進行微調,並整合明確的控制流程模型,以增強其對 IR 相關任務的理解和處理。 -##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging** -2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra +##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?** +2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen -Accurate classification and anatomical localization are essential for -effective medical diagnostics and research, which may be efficiently performed -using deep learning techniques. However, availability of limited labeled data -poses a significant challenge. To address this, we adapted Prototypical -Networks and the Propagation-Reconstruction Network (PRNet) for few-shot -classification and localization, respectively, in Single Photon Emission -Computed Tomography (SPECT) images. For the proof of concept we used a -2D-sliced image cropped around heart. The Prototypical Network, with a -pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver -tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for -2D imaging with an encoder-decoder architecture and skip connections, achieved -a training loss of 1.395, accurately reconstructing patches and capturing -spatial relationships. These results highlight the potential of Prototypical -Networks for tissue classification with limited labeled data and PRNet for -anatomical landmark localization, paving the way for improved performance in -deep learning frameworks. +Long-context large language models (LLMs) have recently shown strong +performance in information retrieval and long-document QA. However, to tackle +the most challenging intellectual problems, LLMs must reason effectively in +long and complex contexts (e.g., frontier mathematical research). Studying how +LLMs handle increasing reasoning complexity and context length is essential, +yet existing benchmarks lack a solid basis for quantitative evaluation. +Inspired by the abstraction of GSM-8K problems as computational graphs, and the +ability to introduce noise by adding unnecessary nodes and edges, we develop a +grade school math problem generator capable of producing arithmetic problems +with infinite difficulty and context length under fine-grained control. Using +our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate +existing LLMs. We find a consistent sigmoid decline in reasoning performance as +complexity increases, along with a systematic inference scaling trend: +exponentially increasing inference computation yields only linear performance +gains. These findings underscore the fundamental limitations of current +long-context LLMs and the key challenges in scaling reasoning capabilities. Our +GSM-Infinite benchmark provides a scalable and controllable testbed for +systematically studying and advancing LLM reasoning in long and complex +contexts. -摘要:精確的分類和解剖定位對於有效的醫療診斷和研究至關重要,而這可以使用深度學習技術有效執行。然而,標記資料有限的取得會造成重大的挑戰。為了解決這個問題,我們分別調整了原型網路和傳播重建網路 (PRNet),用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念,我們使用圍繞心臟裁切的 2D 切片影像。原型網路,使用預先訓練的 ResNet-18 主幹,對心室、心肌和肝臟組織進行分類,訓練準確度為 96.67%,驗證準確度為 93.33%。PRNet,調整為使用編碼器解碼器架構和跳躍連接的 2D 影像,達到了 1.395 的訓練損失,精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力,以及 PRNet 在解剖標誌定位方面的潛力,為深度學習架構中效能的提升鋪平了道路。 +摘要:長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而,若要解決最具挑戰性的智力問題,LLM 必須在長且複雜的脈絡中有效推理(例如,前沿數學研究)。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要,但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發,以及透過加入不必要的節點和邊緣來引入雜訊的能力,我們開發了一個小學數學問題產生器,能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準,我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降,並伴隨著系統性的推論縮放趨勢:指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制,以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台,用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。 -##### **Illegal Waste Detection in Remote Sensing Images: A Case Study** -2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori +##### **Causality can systematically address the monsters under the bench(marks)** +2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf -Environmental crime currently represents the third largest criminal activity -worldwide while threatening ecosystems as well as human health. Among the -crimes related to this activity, improper waste management can nowadays be -countered more easily thanks to the increasing availability and decreasing cost -of Very-High-Resolution Remote Sensing images, which enable semi-automatic -territory scanning in search of illegal landfills. This paper proposes a -pipeline, developed in collaboration with professionals from a local -environmental agency, for detecting candidate illegal dumping sites leveraging -a classifier of Remote Sensing images. To identify the best configuration for -such classifier, an extensive set of experiments was conducted and the impact -of diverse image characteristics and training settings was thoroughly analyzed. -The local environmental agency was then involved in an experimental exercise -where outputs from the developed classifier were integrated in the experts' -everyday work, resulting in time savings with respect to manual -photo-interpretation. The classifier was eventually run with valuable results -on a location outside of the training area, highlighting potential for -cross-border applicability of the proposed pipeline. +Effective and reliable evaluation is essential for advancing empirical +machine learning. However, the increasing accessibility of generalist models +and the progress towards ever more complex, high-level tasks make systematic +evaluation more challenging. Benchmarks are plagued by various biases, +artifacts, or leakage, while models may behave unreliably due to poorly +explored failure modes. Haphazard treatments and inconsistent formulations of +such "monsters" can contribute to a duplication of efforts, a lack of trust in +results, and unsupported inferences. In this position paper, we argue causality +offers an ideal framework to systematically address these challenges. By making +causal assumptions in an approach explicit, we can faithfully model phenomena, +formulate testable hypotheses with explanatory power, and leverage principled +tools for analysis. To make causal model design more accessible, we identify +several useful Common Abstract Topologies (CATs) in causal graphs which help +gain insight into the reasoning abilities in large language models. Through a +series of case studies, we demonstrate how the precise yet pragmatic language +of causality clarifies the strengths and limitations of a method and inspires +new approaches for systematic progress. -摘要:環境犯罪目前是全球第三大犯罪活動,威脅生態系統和人類健康。在與此活動相關的犯罪中,不當廢物管理現在可以更容易地得到解決,這要歸功於超高解析度遙測影像越來越普及且成本下降,這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道,與當地環境機構的專業人士合作開發,用於檢測候選非法傾倒地點,利用遙測影像分類器。為了找出這種分類器的最佳配置,進行了一系列廣泛的實驗,並徹底分析了不同影像特徵和訓練設定的影響。然後,當地環境機構參與了一項實驗練習,其中將已開發分類器的輸出整合到專家的日常工作中,從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器,獲得了有價值的結果,突出了所提出管道的跨境適用性潛力。 +摘要:有效的、可靠的評估對於推進經驗機器學習至關重要。然而,一般化模型的可及性日益提高,以及朝著更複雜、更高級別任務的進展,使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾,而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中,我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設,我們可以忠實地模擬現象,制定具有解釋力的可測試假設,並利用原則性的分析工具。為了使因果模型設計更易於使用,我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT),有助於深入了解大型語言模型中的推理能力。通過一系列案例研究,我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點,並激發系統進展的新方法。 -##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model** -2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li +##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures** +2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha -Accurate and efficient electroencephalography (EEG) analysis is essential for -detecting seizures and artifacts in long-term monitoring, with applications -spanning hospital diagnostics to wearable health devices. Robust EEG analytics -have the potential to greatly improve patient care. However, traditional deep -learning models, especially Transformer-based architectures, are hindered by -their quadratic time and memory complexity, making them less suitable for -resource-constrained environments. To address these challenges, we present -FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel -self-supervised framework that establishes new efficiency benchmarks for EEG -analysis through bidirectional state-space modeling. Unlike Transformer-based -models, which incur quadratic time and memory complexity, FEMBA scales linearly -with sequence length, enabling more scalable and efficient processing of -extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and -fine-tuned on three downstream tasks, FEMBA achieves competitive performance in -comparison with transformer models, with significantly lower computational -cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB -and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates -viability for resource-constrained devices. These results pave the way for -scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as -a promising candidate for wearable applications. +Large Language Models (LLMs) have demonstrated impressive reasoning +capabilities, yet their performance is highly dependent on the prompting +strategy and model scale. While reinforcement learning and fine-tuning have +been deployed to boost reasoning, these approaches incur substantial +computational and data overhead. In this work, we introduce Adaptive Graph of +Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM +reasoning solely at test time. Rather than relying on fixed-step methods like +Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes +complex queries into structured subproblems, forming an dynamic directed +acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding +only those subproblems that require further analysis, AGoT unifies the +strengths of chain, tree, and graph paradigms into a cohesive framework that +allocates computation where it is most needed. We validate our approach on +diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and +mathematical problem-solving, achieving up to 46.2% improvement on scientific +reasoning tasks (GPQA) - comparable to gains achieved through computationally +intensive reinforcement learning approaches and outperforming state-of-the-art +iterative approaches. These results suggest that dynamic decomposition and +structured recursion offer a scalable, cost-effective alternative to +post-training modifications, paving the way for more robust, general-purpose +reasoning in LLMs. -摘要:準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要,其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而,傳統深度學習模型,特別是基於 Transformer 的架構,受到其二次時間和記憶體複雜度的阻礙,使其不太適合資源受限的環境。為了應對這些挑戰,我們提出 FEMBA (基礎 EEG Mamba + 雙向架構),一種創新的自我監督架構,透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同,FEMBA 隨著序列長度線性縮放,支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調,與Transformer模型相比,在計算成本顯著降低的情況下,實現了具有競爭力的效能。具體來說,它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC,而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路,並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。 +摘要:大型語言模型 (LLM) 已展現令人印象深刻的推理能力,但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理,但這些方法會造成大量的運算和資料開銷。在這項工作中,我們引入了「適應性思考圖」(AGoT),一個動態的、基於圖形的推論架構,它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法,而是遞迴地將複雜的查詢分解成結構化的子問題,形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題,AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中,將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法,在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進,這與透過運算密集的強化學習方法所獲得的增益相當,並且優於最先進的迭代方法。這些結果表明,動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案,用於訓練後修改,為 LLM 中更強健、更通用的推理鋪平了道路。 -##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?** -2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham +##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics** +2502.05239v1 by Hussam Ghanem, Christophe Cruz -The advent of foundation models (FMs) is transforming medical domain. In -ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 -million natural images and 1.6 million retinal images, has demonstrated high -adaptability across clinical applications. Conversely, DINOv2, a -general-purpose vision FM pre-trained on 142 million natural images, has shown -promise in non-medical domains. However, its applicability to clinical tasks -remains underexplored. To address this, we conducted head-to-head evaluations -by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular -disease detection and systemic disease prediction tasks, across eight -standardized open-source ocular datasets, as well as the Moorfields AlzEye and -the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting -diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, -all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In -glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, -P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 -models in predicting heart failure, myocardial infarction, and ischaemic stroke -(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even -with 10% of the fine-tuning data. These findings showcase the distinct -scenarios where general-purpose and domain-specific FMs excel, highlighting the -importance of aligning FM selection with task-specific requirements to optimise -clinical performance. +Recent advancements in large language models have demonstrated significant +potential in the automated construction of knowledge graphs from unstructured +text. This paper builds upon our previous work [16], which evaluated various +models using metrics like precision, recall, F1 score, triple matching, and +graph matching, and introduces a refined approach to address the critical +issues of hallucination and omission. We propose an enhanced evaluation +framework incorporating BERTScore for graph similarity, setting a practical +threshold of 95% for graph matching. Our experiments focus on the Mistral +model, comparing its original and fine-tuned versions in zero-shot and few-shot +settings. We further extend our experiments using examples from the KELM-sub +training dataset, illustrating that the fine-tuned model significantly improves +knowledge graph construction accuracy while reducing the exact hallucination +and omission. However, our findings also reveal that the fine-tuned models +perform worse in generalization tasks on the KELM-sub dataset. This study +underscores the importance of comprehensive evaluation metrics in advancing the +state-of-the-art in knowledge graph construction from textual data. -摘要:基礎模型 (FM) 的出現正在轉變醫療領域。在眼科,RETFound 是一個視網膜專用 FM,依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練,已展現出高度適應性,可應用於各種臨床應用。相反地,DINOv2 是一個通用視覺 FM,使用 1.42 億張自然影像進行預訓練,已展現出在非醫療領域的潛力。然而,其在臨床任務中的適用性仍未被充分探索。為了解決這個問題,我們針對眼部疾病偵測和全身性疾病預測任務,對 RETFound 和三個 DINOv2 模型(大型、基礎、小型)進行微調,並進行一對一的評估,使用八個標準化的開源眼科資料集,以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound(三個資料集的 AUROC=0.850-0.952,相較於 0.823-0.944,所有 P<=0.007)和多類眼部疾病(AUROC=0.892,相較於 0.846,P<0.001)。在青光眼方面,DINOv2 基礎模型優於 RETFound(AUROC=0.958,相較於 0.940,P<0.001)。相反地,RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型(AUROC=0.732-0.796,相較於 0.663-0.771,所有 P<0.001)。即使使用 10% 的微調資料,這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景,突顯了根據任務特定需求調整 FM 選擇,以最佳化臨床表現的重要性。 +摘要:大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上,該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型,並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架,結合 BERTScore 來進行圖形相似性,並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上,比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗,說明微調後的模型顯著提高了知識圖譜建構的準確度,同時減少了精確的幻覺和遺漏。然而,我們的研究結果也顯示,微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。 -##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning** -2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun +##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research** +2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu -Medical time series are often irregular and face significant missingness, -posing challenges for data analysis and clinical decision-making. Existing -methods typically adopt a single modeling perspective, either treating series -data as sequences or transforming them into image representations for further -classification. In this paper, we propose a joint learning framework that -incorporates both sequence and image representations. We also design three -self-supervised learning strategies to facilitate the fusion of sequence and -image representations, capturing a more generalizable joint representation. The -results indicate that our approach outperforms seven other state-of-the-art -models in three representative real-world clinical datasets. We further -validate our approach by simulating two major types of real-world missingness -through leave-sensors-out and leave-samples-out techniques. The results -demonstrate that our approach is more robust and significantly surpasses other -baselines in terms of classification performance. +We introduce Agentic Reasoning, a framework that enhances large language +model (LLM) reasoning by integrating external tool-using agents. Unlike +conventional LLM-based reasoning approaches, which rely solely on internal +inference, Agentic Reasoning dynamically engages web search, code execution, +and structured reasoning-context memory to solve complex problems requiring +deep research and multi-step logical deduction. Our framework introduces the +Mind Map agent, which constructs a structured knowledge graph to track logical +relationships, improving deductive reasoning. Additionally, the integration of +web-search and coding agents enables real-time retrieval and computational +analysis, enhancing reasoning accuracy and decision-making. Evaluations on +PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks +demonstrate that our approach significantly outperforms existing models, +including leading retrieval-augmented generation (RAG) systems and +closed-source LLMs. Moreover, our results indicate that agentic reasoning +improves expert-level knowledge synthesis, test-time scalability, and +structured problem-solving. The code is at: +https://github.com/theworldofagents/Agentic-Reasoning. -摘要:醫療時間序列通常不規則且會面臨顯著的缺失,對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點,將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中,我們提出了一個聯合學習架構,結合序列和影像表示。我們還設計了三種自我監督學習策略,以促進序列和影像表示的融合,捕捉更具概括性的聯合表示。結果表明,我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明,我們的做法更強大,並且在分類效能方面顯著優於其他基準。 +摘要:我們引入了代理推理,一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同,代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理,它建立一個結構化的知識圖譜來追蹤邏輯關係,改善演繹推理。此外,整合網路搜尋和編碼代理能進行即時擷取和運算分析,增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示,我們的做法明顯優於現有模型,包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外,我們的結果顯示,代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在:https://github.com/theworldofagents/Agentic-Reasoning。 -##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** -2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek +##### **Position-aware Automatic Circuit Discovery** +2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov -We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), -an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS -predicts future PHTs using transformer-based architectures. The Adaptive Risk -Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk -probabilities for clinician-defined critical events. ARES incorporates a -personalized explainability module that identifies key clinical factors -influencing risk estimates for individual patients. ARES was evaluated on the -MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its -performance against traditional early warning systems and machine learning -models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, -with 60% including hospital admissions. The dataset contained over 357 million -tokens. ETHOS outperformed benchmark models in predicting hospital admissions, -ICU admissions, and prolonged hospital stays, achieving superior AUC scores. -ETHOS-based risk estimates demonstrated robustness across demographic subgroups -with strong model reliability, confirmed via calibration curves. The -personalized explainability module provides insights into patient-specific -factors contributing to risk. ARES, powered by ETHOS, advances predictive -healthcare AI by providing dynamic, real-time, and personalized risk estimation -with patient-specific explainability to enhance clinician trust. Its -adaptability and superior accuracy position it as a transformative tool for -clinical decision-making, potentially improving patient outcomes and resource -allocation in emergency and inpatient settings. We release the full code at -github.com/ipolharvard/ethos-ares to facilitate future research. +A widely used strategy to discover and understand language model mechanisms +is circuit analysis. A circuit is a minimal subgraph of a model's computation +graph that executes a specific task. We identify a gap in existing circuit +discovery methods: they assume circuits are position-invariant, treating model +components as equally relevant across input positions. This limits their +ability to capture cross-positional interactions or mechanisms that vary across +positions. To address this gap, we propose two improvements to incorporate +positionality into circuits, even on tasks containing variable-length examples. +First, we extend edge attribution patching, a gradient-based method for circuit +discovery, to differentiate between token positions. Second, we introduce the +concept of a dataset schema, which defines token spans with similar semantics +across examples, enabling position-aware circuit discovery in datasets with +variable length examples. We additionally develop an automated pipeline for +schema generation and application using large language models. Our approach +enables fully automated discovery of position-sensitive circuits, yielding +better trade-offs between circuit size and faithfulness compared to prior work. -摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), -一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS -使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 +摘要:廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖,可執行特定任務。我們找出電路發現方法中的一個缺口:它們假設電路與位置無關,將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口,我們提出兩項改進,將位置性納入電路中,即使在包含變長範例的任務中也是如此。首先,我們擴充邊緣屬性修補,一種基於梯度的電路發現方法,以區分符號位置。其次,我們引入了資料集架構的概念,它定義了在範例中具有類似語義的符號跨距,使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外,我們開發了一個自動化管線,用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化,與先前的研究相比,在電路大小和忠實度之間產生了更好的權衡。 -##### **Can ChatGPT Diagnose Alzheimer's Disease?** -2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin +##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems** +2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister -Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating -neurodegenerative condition that affects approximately 1 in 9 individuals aged -65 and older, profoundly impairing memory and cognitive function. This paper -utilises 9300 electronic health records (EHRs) with data from Magnetic -Resonance Imaging (MRI) and cognitive tests to address an intriguing question: -As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs? -We present an in-depth evaluation of ChatGPT using a black-box approach with -zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to -analyse MRI and cognitive test results, as well as its potential as a -diagnostic tool for AD. By automating aspects of the diagnostic process, this -research opens a transformative approach for the healthcare system, -particularly in addressing disparities in resource-limited regions where AD -specialists are scarce. Hence, it offers a foundation for a promising method -for early detection, supporting individuals with timely interventions, which is -paramount for Quality of Life (QoL). +We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by +jointly optimizing model roles and weights. We represent multi-LLM systems as +directed acyclic graphs (DAGs) of LLMs with topological message passing for +collaborative generation. Given a pool of LLM experts and a utility function, +Heterogeneous Swarms employs two iterative steps: role-step and weight-step. +For role-step, we interpret model roles as learning a DAG that specifies the +flow of inputs and outputs between LLMs. Starting from a swarm of random +continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs +in topological order, evaluate on the utility function (e.g. accuracy on a +task), and optimize the adjacency matrices with particle swarm optimization +based on the utility score. For weight-step, we assess the contribution of +individual LLMs in the multi-LLM systems and optimize model weights with swarm +intelligence. We propose JFK-score to quantify the individual contribution of +each LLM in the best-found DAG of the role-step, then optimize model weights +with particle swarm optimization based on the JFK-score. Experiments +demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based +baselines by 18.5% on average across 12 tasks. Further analysis reveals that +Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles +and substantial collaborative gains, and benefits from the diversity of +language models. -摘要:ChatGPT 能否診斷出阿茲海默症 (AD)?AD 是一種毀滅性的神經退化性疾病,影響約 1/9 的 65 歲及以上人士,嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR),其中包含磁共振成像 (MRI) 和認知測試的數據,來解決一個有趣的問題:作為一個通用任務解決器,ChatGPT 能否使用 EHR 準確地檢測出 AD?我們使用黑盒方法對 ChatGPT 進行了深入評估,採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力,以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面,這項研究為醫療保健系統開啟了一種變革性的方法,特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此,它為一種有希望的早期檢測方法奠定了基礎,通過及時干預來支持個人,這對於生活品質 (QoL) 至關重要。 +摘要:我們提出異質群體,一種演算法,透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG),並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數,異質群體使用兩個反覆步驟:角色步驟和權重步驟。對於角色步驟,我們將模型角色解釋為學習一個 DAG,它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始,我們將它們解碼為離散 DAG,以拓撲順序呼叫 LLM,根據效用函數(例如任務的準確度)進行評估,並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟,我們評估個別 LLM 在多 LLM 系統中的貢獻,並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻,然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明,異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明,異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統,並受益於語言模型的多樣性。 -##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking** -2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares +##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot** +2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao -EEG-based neural networks, pivotal in medical diagnosis and brain-computer -interfaces, face significant intellectual property (IP) risks due to their -reliance on sensitive neurophysiological data and resource-intensive -development. Current watermarking methods, particularly those using abstract -trigger sets, lack robust authentication and fail to address the unique -challenges of EEG models. This paper introduces a cryptographic wonder -filter-based watermarking framework tailored for EEG-based neural networks. -Leveraging collision-resistant hashing and public-key encryption, the wonder -filter embeds the watermark during training, ensuring minimal distortion ($\leq -5\%$ drop in EEG task accuracy) and high reliability (100\% watermark -detection). The framework is rigorously evaluated against adversarial attacks, -including fine-tuning, transfer learning, and neuron pruning. Results -demonstrate persistent watermark retention, with classification accuracy for -watermarked states remaining above 90\% even after aggressive pruning, while -primary task performance degrades faster, deterring removal attempts. Piracy -resistance is validated by the inability to embed secondary watermarks without -severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic -hashing ensures authentication, reducing brute-force attack success -probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet, -TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively -eliminating false positives. By integrating wonder filters with EEG-specific -adaptations, this work bridges a critical gap in IP protection for -neurophysiological models, offering a secure, tamper-proof solution for -healthcare and biometric applications. The framework's robustness against -adversarial modifications underscores its potential to safeguard sensitive EEG -models while maintaining diagnostic utility. +Retrieval-augmented generation (RAG) is a well-suited technique for +retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a +key module of the healthcare copilot, helping reduce misdiagnosis for +healthcare practitioners and patients. However, the diagnostic accuracy and +specificity of existing heuristic-based RAG models used in the medical domain +are inadequate, particularly for diseases with similar manifestations. This +paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited +reasoning for the medical domain that retrieves diagnosis and treatment +recommendations based on manifestations. MedRAG systematically constructs a +comprehensive four-tier hierarchical diagnostic KG encompassing critical +diagnostic differences of various diseases. These differences are dynamically +integrated with similar EHRs retrieved from an EHR database, and reasoned +within a large language model. This process enables more accurate and specific +decision support, while also proactively providing follow-up questions to +enhance personalized medical decision-making. MedRAG is evaluated on both a +public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) +collected from Tan Tock Seng Hospital, and its performance is compared against +various existing RAG methods. Experimental results show that, leveraging the +information integration and relational abilities of the KG, our MedRAG provides +more specific diagnostic insights and outperforms state-of-the-art models in +reducing misdiagnosis rates. Our code will be available at +https://github.com/SNOWTEAM2023/MedRAG -摘要:基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要,由於其依賴敏感的神經生理資料和資源密集型的開發,面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法,特別是那些使用抽象觸發集的方法,缺乏強健的驗證,且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密,wonder 濾波器在訓練期間嵌入浮水印,確保最小的失真(EEG 任務準確度下降 $\leq 5\%$)和高可靠性(100% 浮水印檢測)。該架構針對對抗性攻擊進行了嚴格的評估,包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留,即使在激進的剪枝後,浮水印狀態的分類準確度仍保持在 90% 以上,而主要任務的性能下降得更快,阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證,而不會造成嚴重的準確度損失(在 EEGNet 和 CCNN 模型中 $>10\%$)。密碼學雜湊確保驗證,降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型(CCNN、EEGNet、TSception)進行評估,該方法達到了 $>99.4\%$ 的空嵌入準確度,有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合,這項工作彌補了神經生理模型 IP 保護中的關鍵差距,為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。 +摘要:檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組,協助減少醫療保健從業人員和患者的誤診。然而,在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足,特別是對於具有類似表現的疾病。本文提出 MedRAG,一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型,用於醫療領域,它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG,涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合,並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援,同時主動提供後續問題,以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估,並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示,利用 KG 的資訊整合和關係能力,我們的 MedRAG 提供了更具體的診斷見解,並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供 -##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models** -2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen +##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering** +2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck -Depression is one of the leading causes of disability worldwide, posing a -severe burden on individuals, healthcare systems, and society at large. Recent -advancements in Large Language Models (LLMs) have shown promise in addressing -mental health challenges, including the detection of depression through -text-based analysis. However, current LLM-based methods often struggle with -nuanced symptom identification and lack a transparent, step-by-step reasoning -process, making it difficult to accurately classify and explain mental health -conditions. To address these challenges, we propose a Chain-of-Thought -Prompting approach that enhances both the performance and interpretability of -LLM-based depression detection. Our method breaks down the detection process -into four stages: (1) sentiment analysis, (2) binary depression classification, -(3) identification of underlying causes, and (4) assessment of severity. By -guiding the model through these structured reasoning steps, we improve -interpretability and reduce the risk of overlooking subtle clinical indicators. -We validate our method on the E-DAIC dataset, where we test multiple -state-of-the-art large language models. Experimental results indicate that our -Chain-of-Thought Prompting technique yields superior performance in both -classification accuracy and the granularity of diagnostic insights, compared to -baseline approaches. +Most existing Knowledge Graph Question Answering (KGQA) approaches are +designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the +heterogeneity of the underlying graph schema, topology and assertions, most +KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without +resource-intensive training data. We present OntoSCPrompt, a novel Large +Language Model (LLM)-based KGQA approach with a two-stage architecture that +separates semantic parsing from KG-dependent interactions. OntoSCPrompt first +generates a SPARQL query structure (including SPARQL keywords such as SELECT, +ASK, WHERE and placeholders for missing tokens) and then fills them with +KG-specific information. To enhance the understanding of the underlying KG, we +present an ontology-guided, hybrid prompt learning strategy that integrates KG +ontology into the learning process of hybrid prompts (e.g., discrete and +continuous vectors). We also present several task-specific decoding strategies +to ensure the correctness and executability of generated SPARQL queries in both +stages. Experimental results demonstrate that OntoSCPrompt performs as well as +SOTA approaches without retraining on a number of KGQA datasets such as CWQ, +WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well +to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: +\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} -摘要:憂鬱症是全球殘障的主要原因之一,對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望,包括透過基於文字的分析來偵測憂鬱症。然而,現有的基於 LLM 的方法通常難以辨識細微的症狀,而且缺乏透明且逐步的推理過程,這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰,我們提出了一種思考鏈提示方法,它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段:(1) 情緒分析,(2) 二元憂鬱症分類,(3) 找出潛在原因,以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟,我們提升了可解釋性,並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法,並在其中測試了多種最先進的大型語言模型。實驗結果顯示,與基線方法相比,我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。 +摘要:現有的知識圖譜問答(KGQA)方法大多是為特定 KG 而設計的,例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性,大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜(KG)。我們提出 OntoSCPrompt,這是一種基於大型語言模型(LLM)的新型 KGQA 方法,採用兩階段架構,將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構(包括 SPARQL 關鍵字,例如 SELECT、ASK、WHERE 和缺失令牌的佔位符),然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解,我們提出了一種由本体指導的混合提示學習策略,將 KG 本体整合到混合提示(例如,離散和連續向量)的學習過程中。我們還提出了多種特定任務的解碼策略,以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明,OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時,效能與 SOTA 方法一樣好,且資源使用效率高,並且可以很好地概括到未見過的特定領域 KG,例如 DBLP-QuAD 和 CoyPu KG Code: +\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} -##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison** -2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis +##### **Multimodal Medical Code Tokenizer** +2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik -The increasing volume of drug combinations in modern therapeutic regimens -needs reliable methods for predicting drug-drug interactions (DDIs). While -Large Language Models (LLMs) have revolutionized various domains, their -potential in pharmaceutical research, particularly in DDI prediction, remains -largely unexplored. This study thoroughly investigates LLMs' capabilities in -predicting DDIs by uniquely processing molecular structures (SMILES), target -organisms, and gene interaction data as raw text input from the latest DrugBank -dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4, -Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first -assessing their zero-shot capabilities in DDI prediction. We then fine-tuned -selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1 -distilled Qwen 1.5B) to optimize their performance. Our comprehensive -evaluation framework included validation across 13 external DDI datasets, -comparing against traditional approaches such as l2-regularized logistic -regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5 -2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of -0.919 on balanced datasets (50% positive, 50% negative cases). This result -represents an improvement over both zero-shot predictions and state-of-the-art -machine-learning methods used for DDI prediction. Our analysis reveals that -LLMs can effectively capture complex molecular interaction patterns and cases -where drug pairs target common genes, making them valuable tools for practical -applications in pharmaceutical research and clinical settings. +Foundation models trained on patient electronic health records (EHRs) require +tokenizing medical data into sequences of discrete vocabulary items. Existing +tokenizers treat medical codes from EHRs as isolated textual tokens. However, +each medical code is defined by its textual description, its position in +ontological hierarchies, and its relationships to other codes, such as disease +co-occurrences and drug-treatment associations. Medical vocabularies contain +more than 600,000 codes with critical information for clinical reasoning. We +introduce MedTok, a multimodal medical code tokenizer that uses the text +descriptions and relational context of codes. MedTok processes text using a +language model encoder and encodes the relational structure with a graph +encoder. It then quantizes both modalities into a unified token space, +preserving modality-specific and cross-modality information. We integrate +MedTok into five EHR models and evaluate it on operational and clinical tasks +across in-patient and out-patient datasets, including outcome prediction, +diagnosis classification, drug recommendation, and risk stratification. +Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR +models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with +the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate +using MedTok tokenizer with medical QA systems. Our results demonstrate the +potential of MedTok as a unified tokenizer for medical codes, improving +tokenization for medical foundation models. -摘要:現代治療方案中藥物組合的數量越來越多,需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命,它們在藥物研究中的潛力,特別是在 DDI 預測中的潛力,仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入,徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM,包括專有模型(GPT-4、Claude、Gemini)和開源變體(從 1.5B 到 72B 參數),首先評估它們在 DDI 預測中的零次學習能力。然後,我們微調選定的模型(GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B)以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證,並與傳統方法(例如 l2 正則化邏輯迴歸)進行比較。微調後的 LLM 表現出優異的效能,其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度,在平衡資料集(50% 正例,50% 反例)上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明,LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況,使其成為藥物研究和臨床環境中實用應用的寶貴工具。 +摘要:在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而,每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系(例如疾病共现和药物治疗关联)来定义。医学词汇表包含超过 600,000 个代码,这些代码包含临床推理的关键信息。我们引入了 MedTok,这是一种多模态医学代码标记器,它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本,并使用图编码器对关系结构进行编码。然后,它将这两种模态量化为一个统一的标记空间,保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中,并在住院和门诊数据集(包括结果预测、诊断分类、药物推荐和风险分层)上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC,在 MIMIC-III 上提高 4.10%,在 MIMIC-IV 上提高 4.78%,在 EHRShot 上提高 11.30%,其中药物推荐的增益最大。除了 EHR 建模之外,我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力,改进了医学基础模型的标记化。 -##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)** -2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh +##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents** +2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu -Detecting sensitive data such as Personally Identifiable Information (PII) -and Protected Health Information (PHI) is critical for data security platforms. -This study evaluates regex-based pattern matching algorithms and exact-match -search techniques to optimize detection speed, accuracy, and scalability. Our -benchmarking results indicate that Google RE2 provides the best balance of -speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among -regex engines, outperforming PCRE while maintaining broader hardware -compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated -superior performance (8 ms/MB) and scalability for large datasets. Performance -analysis revealed that regex processing time scales linearly with dataset size -and pattern complexity. A hybrid AI + Regex approach achieved the highest F1 -score (91. 6%) by improving recall and minimizing false positives. Device -benchmarking confirmed that our solution maintains efficient CPU and memory -usage on both high-performance and mid-range systems. Despite its -effectiveness, challenges remain, such as limited multilingual support and the -need for regular pattern updates. Future work should focus on expanding -language coverage, integrating data security and privacy management (DSPM) with -data loss prevention (DLP) tools, and enhancing regulatory compliance for -broader global adoption. +The rapid expansion of web content has made on-device AI assistants +indispensable for helping users manage the increasing complexity of online +tasks. The emergent reasoning ability in large language models offer a +promising path for next-generation on-device AI agents. However, deploying +full-scale Large Language Models (LLMs) on resource-limited local devices is +challenging. In this paper, we propose Division-of-Thoughts (DoT), a +collaborative reasoning framework leveraging the synergy between locally +deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT +leverages a Task Decomposer to elicit the inherent planning abilities in +language models to decompose user queries into smaller sub-tasks, which allows +hybrid language models to fully exploit their respective strengths. Besides, +DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks +and create a dependency graph, facilitating parallel reasoning of sub-tasks and +the identification of key steps. To allocate the appropriate model based on the +difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an +additional task head attached to the SLM that does not alter the SLM's +parameters. To boost adapter's task allocation capability, we propose a +self-reinforced training method that relies solely on task execution feedback. +Extensive experiments on various benchmarks demonstrate that our DoT +significantly reduces LLM costs while maintaining competitive reasoning +accuracy. Specifically, DoT reduces the average reasoning time and API costs by +66.12% and 83.57%, while achieving comparable reasoning accuracy with the best +baseline methods. -摘要:偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料,對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術,以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示,在 regex 引擎中,Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡,優於 PCRE,同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對,Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示,regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低,達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效,但仍有挑戰存在,例如多語言支援有限,以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍,將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合,以及加強法規遵循以利更廣泛的全球採用。 +摘要:網頁內容快速擴充,使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而,在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中,我們提出了思想分工 (DoT),一個協作推理框架,利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力,將使用者查詢分解成較小的子任務,這允許混合語言模型充分發揮其各自的優勢。此外,DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖,促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型,DoT 利用了即插即用適配器,這是一個附加在 SLM 上的任務頭,不會改變 SLM 的參數。為了提升適配器的任務分配能力,我們提出了一種自我強化訓練方法,它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明,我們的 DoT 大幅降低了 LLM 成本,同時維持了有競爭力的推理準確度。具體來說,DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%,同時達到了與最佳基準方法相當的推理準確度。 -##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch** -2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu +##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models** +2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong -While just-in-time interventions (JITIs) have effectively targeted common -health behaviors, individuals often have unique needs to intervene in personal -undesirable actions that can negatively affect physical, mental, and social -well-being. We present WatchGuardian, a smartwatch-based JITI system that -empowers users to define custom interventions for these personal actions with a -small number of samples. For the model to detect new actions based on limited -new data samples, we developed a few-shot learning pipeline that finetuned a -pre-trained inertial measurement unit (IMU) model on public hand-gesture -datasets. We then designed a data augmentation and synthesis process to train -additional classification layers for customization. Our offline evaluation with -26 participants showed that with three, five, and ten examples, our approach -achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of -74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to -compare WatchGuardian against a rule-based intervention. Our results -demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in -undesirable actions, substantially outperforming the baseline by 29.0%. Our -findings underscore the effectiveness of a customizable, AI-driven JITI system -for individuals in need of behavioral intervention in personal undesirable -actions. We envision that our work can inspire broader applications of -user-defined personalized intervention with advanced AI solutions. +Knowledge Graph-based recommendations have gained significant attention due +to their ability to leverage rich semantic relationships. However, constructing +and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy +of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent +advancements in Large Language Models (LLMs) offer a promising way to improve +the quality and relevance of KGs for recommendation tasks. Despite this, +integrating LLMs into KG-based systems presents challenges, such as efficiently +augmenting KGs, addressing hallucinations, and developing effective joint +learning methods. In this paper, we propose the Confidence-aware KG-based +Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework +that combines KGs and LLMs for recommendation task. The framework includes: (1) +an LLM-based subgraph augmenter for enriching KGs with high-quality +information, (2) a confidence-aware message propagation mechanism to filter +noisy triplets, and (3) a dual-view contrastive learning method to integrate +user-item interactions and KG data. Additionally, we employ a confidence-aware +explanation generation process to guide LLMs in producing realistic +explanations for recommendations. Finally, extensive experiments demonstrate +the effectiveness of CKG-LLMA across multiple public datasets. -摘要:雖然即時介入(JITIs)有效地針對常見的健康行為,但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian,這是一個基於智慧手錶的 JITI 系統,它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為,我們開發了一個小樣本學習管道,微調了公共手勢資料集上的預訓練慣性測量單元(IMU)模型。然後,我們設計了一個資料擴充和合成流程,以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示,我們的做法使用三個、五個和十個範例,達到了 76.8%、84.7% 和 87.7% 的平均準確度,以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後,我們進行了一項為時四小時的介入研究,以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明,我們的系統導致不良行為顯著減少了 64.0 +- 22.6%,大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用,並採用先進的 AI 解決方案。 +摘要:基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而,構建和維護知識圖譜 (KG) 是一項資源密集型任務,而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此,將 LLM 整合到基於 KG 的系統中會帶來挑戰,例如有效擴充 KG、處理幻覺,以及開發有效的聯合學習方法。在本文中,我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA),這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括:(1) 一個基於 LLM 的子圖擴充器,用於使用高品質資訊豐富 KG,(2) 一個信心感知型訊息傳播機制,用於過濾雜訊三元組,以及 (3) 一個雙視圖對比學習方法,用於整合使用者-項目互動和 KG 資料。此外,我們採用一個信心感知型解釋產生程序,以引導 LLM 為推薦產生逼真的解釋。最後,大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。 -##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care** -2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara +##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)** +2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell -Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group -of cancers that account for more than 35% of cancer-related deaths worldwide, -but postoperative complications are unpredictable and can be life-threatening. -In this paper, we investigate how recent advancements in large language models -(LLMs) can benefit remote patient monitoring (RPM) systems through clinical -integration by designing RECOVER, an LLM-powered RPM system for postoperative -GI cancer care. To closely engage stakeholders in the design process, we first -conducted seven participatory design sessions with five clinical staff and -interviewed five cancer patients to derive six major design strategies for -integrating clinical guidelines and information needs into LLM-based RPM -systems. We then designed and implemented RECOVER, which features an -LLM-powered conversational agent for cancer patients and an interactive -dashboard for clinical staff to enable efficient postoperative RPM. Finally, we -used RECOVER as a pilot system to assess the implementation of our design -strategies with four clinical staff and five patients, providing design -implications by identifying crucial design elements, offering insights on -responsible AI, and outlining opportunities for future LLM-powered RPM systems. +Scene graphs have emerged as a structured and serializable environment +representation for grounded spatial reasoning with Large Language Models +(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason +framework for reasoning and planning with scene graphs. Our approach employs +two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and +information queries generation, and a (2) Retriever for extracting +corresponding graph information following the queries. Two agents collaborate +iteratively, enabling sequential reasoning and adaptive attention to graph +information. Unlike prior works, both agents are prompted only with the scene +graph schema rather than the full graph data, which reduces the hallucination +by limiting input tokens, and drives the Reasoner to generate reasoning trace +abstractly.Following the trace, the Retriever programmatically query the scene +graph data based on the schema understanding, allowing dynamic and global +attention on the graph that enhances alignment between reasoning and retrieval. +Through experiments in multiple simulation environments, we show that our +framework surpasses existing LLM-based approaches in numerical Q\&A and +planning tasks, and can benefit from task-level few-shot examples, even in the +absence of agent-level demonstrations. Project code will be released. -摘要:癌症手術是胃腸道 (GI) 癌症的主要治療方式,這類癌症佔全球癌症相關死亡人數的 35% 以上,但術後併發症無法預測,且可能危及生命。在本文中,我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統,方法是設計 RECOVER,一個由 LLM 驅動的 RPM 系統,用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程,我們首先與五位臨床人員進行七場參與式設計會議,並訪談五位癌症患者,以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著,我們設計並實作 RECOVER,其特色在於一個由 LLM 驅動的對話式代理人,供癌症患者使用,以及一個互動式儀表板,供臨床人員使用,以進行有效的術後 RPM。最後,我們使用 RECOVER 作為試點系統,與四位臨床人員和五位患者評估我們設計策略的實作,並透過找出重要的設計元素、提供對負責任 AI 的見解,以及概述未來由 LLM 驅動的 RPM 系統的機會,提出設計意涵。 +摘要:場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中,我們提出 SG-RwR,一個以綱要為導向的檢索與推理框架,用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理:一個 (1) 推論器,用於任務規劃和資訊查詢產生,以及一個 (2) 檢索器,用於根據查詢提取對應的圖形資訊。兩個代理反覆合作,實現對圖形資訊的順序推理和適應性關注。與先前的作品不同,兩個代理僅提示場景圖表綱要,而不是完整的圖形資料,這透過限制輸入代碼減少了幻覺,並驅使推論器抽象地產生推理軌跡。根據軌跡,檢索器根據綱要理解以程式化方式查詢場景圖形資料,允許對圖形進行動態和整體關注,增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗,我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法,並且可以受益於任務級別的少次範例,即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。 -##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis** -2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander +##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs** +2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin -Understanding the progression trajectories of diseases is crucial for early -diagnosis and effective treatment planning. This is especially vital for -life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a -chronic, progressive lung disease with a prognosis comparable to many cancers. -Computed tomography (CT) imaging has been established as a reliable diagnostic -tool for IPF. Accurately predicting future CT scans of early-stage IPF patients -can aid in developing better treatment strategies, thereby improving survival -outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial -Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF -patients at any time point. The model is trained using a two-stage approach. In -the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the -second stage, a Neural Ordinary Differential Equation (ODE) based temporal -model is trained to capture the temporal dynamics of the quantised embeddings -generated by the encoder in the first stage. We evaluate different -configurations of our model for generating longitudinal CT scans and compare -the results against ground truth data, both quantitatively and qualitatively. -For validation, we conduct survival analysis using imaging biomarkers derived -from generated CT scans and achieve a C-index comparable to that of biomarkers -derived from the real CT scans. The survival analysis results demonstrate the -potential clinical utility inherent to generated longitudinal CT scans, showing -that they can reliably predict survival outcomes. +Recent advancements have highlighted that Large Language Models (LLMs) are +prone to hallucinations when solving complex reasoning problems, leading to +erroneous results. To tackle this issue, researchers incorporate Knowledge +Graphs (KGs) to improve the reasoning ability of LLMs. However, existing +methods face two limitations: 1) they typically assume that all answers to the +questions are contained in KGs, neglecting the incompleteness issue of KGs, and +2) they treat the KG as a static repository and overlook the implicit logical +reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an +innovative neural-symbolic agent framework that achieves collaborative +augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments +and transform complex reasoning tasks into a multi-step interactive process, +enabling KGs to participate deeply in the reasoning process. SymAgent consists +of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages +LLM's inductive reasoning capability to extract symbolic rules from KGs, +guiding efficient question decomposition. The Agent-Executor autonomously +invokes predefined action tools to integrate information from KGs and external +documents, addressing the issues of KG incompleteness. Furthermore, we design a +self-learning framework comprising online exploration and offline iterative +policy updating phases, enabling the agent to automatically synthesize +reasoning trajectories and improve performance. Experimental results +demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields +better or comparable performance compared to various strong baselines. Further +analysis reveals that our agent can identify missing triples, facilitating +automatic KG updates. -摘要:了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要,IPF 是一種慢性、進行性肺部疾病,其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略,從而改善存活結果。在本文中,我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN),這是一個模型,能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段,訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段,訓練基於神經常微分方程 (ODE) 的時間模型,以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置,以生成縱向 CT 掃描,並在定量和定性方面將結果與真實數據進行比較。為了驗證,我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析,並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用,表明它們可以可靠地預測存活結果。 +摘要:最近的進展強調出,大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺,導致錯誤的結果。為了解決這個問題,研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而,現有方法面臨兩個限制:1) 它們通常假設問題的所有答案都包含在 KG 中,忽略了 KG 的不完整性問題,以及 2) 它們將 KG 視為一個靜態儲存庫,而忽略了 KG 中固有的隱式邏輯推理結構。在本文中,我們介紹了 SymAgent,一個創新的神經符號代理架構,它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境,並將複雜的推理任務轉化為一個多步驟的互動過程,使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組:代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則,指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊,解決 KG 不完整性的問題。此外,我們設計了一個自學習框架,包括線上探索和離線反覆的政策更新階段,使代理能夠自動合成推理軌跡並改善效能。實驗結果表明,具有弱 LLM 主幹的 SymAgent(例如,7B 系列)與各種強大的基線相比,產生了更好或相當的效能。進一步的分析表明,我們的代理可以識別遺失的三元組,促進自動 KG 更新。 -##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy** -2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho +##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models** +2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov -The increasing demand for mental health services has led to the rise of -AI-driven mental health chatbots, though challenges related to privacy, data -collection, and expertise persist. Motivational Interviewing (MI) is gaining -attention as a theoretical basis for boosting expertise in the development of -these chatbots. However, existing datasets are showing limitations for training -chatbots, leading to a substantial demand for publicly available resources in -the field of MI and psychotherapy. These challenges are even more pronounced in -non-English languages, where they receive less attention. In this paper, we -propose a novel framework that simulates MI sessions enriched with the -expertise of professional therapists. We train an MI forecaster model that -mimics the behavioral choices of professional therapists and employ Large -Language Models (LLMs) to generate utterances through prompt engineering. Then, -we present KMI, the first synthetic dataset theoretically grounded in MI, -containing 1,000 high-quality Korean Motivational Interviewing dialogues. -Through an extensive expert evaluation of the generated dataset and the -dialogue model trained on it, we demonstrate the quality, expertise, and -practicality of KMI. We also introduce novel metrics derived from MI theory in -order to evaluate dialogues from the perspective of MI. +We introduce a new approach to systematically map features discovered by +sparse autoencoder across consecutive layers of large language models, +extending earlier work that examined inter-layer feature links. By using a +data-free cosine similarity technique, we trace how specific features persist, +transform, or first appear at each stage. This method yields granular flow +graphs of feature evolution, enabling fine-grained interpretability and +mechanistic insights into model computations. Crucially, we demonstrate how +these cross-layer feature maps facilitate direct steering of model behavior by +amplifying or suppressing chosen features, achieving targeted thematic control +in text generation. Together, our findings highlight the utility of a causal, +cross-layer interpretability framework that not only clarifies how features +develop through forward passes but also provides new means for transparent +manipulation of large language models. -摘要:由於對心理健康服務的需求日益增加,導致以人工智慧為基礎的心理健康聊天機器人興起,儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而,現有的資料集顯示出訓練聊天機器人的限制,導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯,因為它們受到的關注較少。在本文中,我們提出了一個新穎的架構,它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型,它模擬了專業治療師的行為選擇,並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後,我們展示了 KMI,這是第一個理論上以 MI 為基礎的合成資料集,其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估,我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標,以便從 MI 的角度評估對話。 +摘要:我們提出了一種新方法,用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能,擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術,我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖,實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是,我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導,在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用,不僅闡明了特徵如何透過前向傳遞發展,還提供了新的方法來透明地操作大型語言模型。 -##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports** -2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco +##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs** +2502.02896v1 by Bradley P. Allen, Paul T. Groth + +Evaluating large language models (LLMs) for tasks like fact extraction in +support of knowledge graph construction frequently involves computing accuracy +metrics using a ground truth benchmark based on a knowledge graph (KG). These +evaluations assume that errors represent factual disagreements. However, human +discourse frequently features metalinguistic disagreement, where agents differ +not on facts but on the meaning of the language used to express them. Given the +complexity of natural language processing and generation using LLMs, we ask: do +metalinguistic disagreements occur between LLMs and KGs? Based on an +investigation using the T-REx knowledge alignment dataset, we hypothesize that +metalinguistic disagreement does in fact occur between LLMs and KGs, with +potential relevance for the practice of knowledge graph engineering. We propose +a benchmark for evaluating the detection of factual and metalinguistic +disagreements between LLMs and KGs. An initial proof of concept of such a +benchmark is available on Github. -Europe's healthcare systems require enhanced interoperability and -digitalization, driving a demand for innovative solutions to process legacy -clinical data. This paper presents the results of our project, which aims to -leverage Large Language Models (LLMs) to extract structured information from -unstructured clinical reports, focusing on patient history, diagnoses, -treatments, and other predefined categories. We developed a workflow with a -user interface and evaluated LLMs of varying sizes through prompting strategies -and fine-tuning. Our results show that fine-tuned smaller models match or -surpass larger counterparts in performance, offering efficiency for -resource-limited settings. A new dataset of 60,000 annotated English clinical -summaries and 24,000 German translations was validated with automated and -manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. -The work highlights the approach's viability and outlines future improvements. +摘要:評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時,通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而,人類話語經常出現元語言分歧,其中代理人之間的差異不在於事實,而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性,我們提出疑問:LLM 和 KG 之間是否會發生元語言分歧?根據使用 T-REx 知識比對資料集進行的調查,我們假設元語言分歧確實會發生在 LLM 和 KG 之間,並可能與知識圖譜工程實務有關。我們提出一個基準,用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。 -摘要:歐洲的醫療保健系統需要增強互通性和數位化,這驅動了對創新解決方案的需求,以處理傳統的臨床數據。本文介紹了我們專案的成果,該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊,重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程,並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示,微調後的較小模型在效能上與較大的模型相匹配或超越它們,為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性,並概述了未來的改進。 +##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization** +2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim +Recent advances in Large Language Models (LLMs) have motivated the +development of general LLMs for molecular tasks. While several studies have +demonstrated that fine-tuned LLMs can achieve impressive benchmark +performances, they are far from genuine generalist molecular LLMs due to a lack +of fundamental understanding of molecular structure. Specifically, when given +molecular task instructions, LLMs trained with naive next-token prediction +training assign similar likelihood scores to both original and negatively +corrupted molecules, revealing their lack of molecular structure understanding +that is crucial for reliable and general molecular LLMs. To overcome this +limitation and obtain a true generalist molecular LLM, we introduce a novel +multi-modal training method based on a thorough multi-modal instruction tuning +as well as a molecular structure preference optimization between chosen and +rejected graphs. On various molecular benchmarks, the proposed generalist +molecular LLM, called Mol-LLM, achieves state-of-the-art performances among +generalist LLMs on most tasks, at the same time, surpassing or comparable to +state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior +generalization performances in reaction prediction tasks, demonstrating the +effect of the molecular structure understanding for generalization perspective. -### LLM -|Publish Date|Title|Authors|Homepage|Code| -| :---: | :---: | :---: | :---: | :---: | -|**2025-02-20**|**LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention**|Shang Yang et.al.|[2502.14866v1](http://arxiv.org/abs/2502.14866v1)|null| -|**2025-02-20**|**Interpretable Text Embeddings and Text Similarity Explanation: A Primer**|Juri Opitz et.al.|[2502.14862v1](http://arxiv.org/abs/2502.14862v1)|null| -|**2025-02-20**|**Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning**|Shuyue Stella Li et.al.|[2502.14860v1](http://arxiv.org/abs/2502.14860v1)|null| -|**2025-02-20**|**FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling**|Weilin Zhao et.al.|[2502.14856v1](http://arxiv.org/abs/2502.14856v1)|null| -|**2025-02-20**|**Prompt-to-Leaderboard**|Evan Frick et.al.|[2502.14855v1](http://arxiv.org/abs/2502.14855v1)|null| -|**2025-02-20**|**CLIPPER: Compression enables long-context synthetic data generation**|Chau Minh Pham et.al.|[2502.14854v1](http://arxiv.org/abs/2502.14854v1)|null| -|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| -|**2025-02-20**|**Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation**|Yue Yang et.al.|[2502.14846v1](http://arxiv.org/abs/2502.14846v1)|null| -|**2025-02-20**|**Revealing and Mitigating Over-Attention in Knowledge Editing**|Pinzheng Wang et.al.|[2502.14838v1](http://arxiv.org/abs/2502.14838v1)|null| -|**2025-02-20**|**Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs**|Tao Ji et.al.|[2502.14837v1](http://arxiv.org/abs/2502.14837v1)|null| -|**2025-02-20**|**LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models**|Shangqing Tu et.al.|[2502.14834v1](http://arxiv.org/abs/2502.14834v1)|null| -|**2025-02-20**|**Improving the Diffusability of Autoencoders**|Ivan Skorokhodov et.al.|[2502.14831v1](http://arxiv.org/abs/2502.14831v1)|null| -|**2025-02-20**|**Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs**|Danni Liu et.al.|[2502.14830v1](http://arxiv.org/abs/2502.14830v1)|null| -|**2025-02-20**|**Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps**|Martin Tutek et.al.|[2502.14829v1](http://arxiv.org/abs/2502.14829v1)|null| -|**2025-02-20**|**Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison**|Aiswarya Baby et.al.|[2502.14827v1](http://arxiv.org/abs/2502.14827v1)|null| -|**2025-02-20**|**eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables**|Luis Antonio Gutiérrez Guanilo et.al.|[2502.14820v1](http://arxiv.org/abs/2502.14820v1)|null| -|**2025-02-20**|**Optimizing Model Selection for Compound AI Systems**|Lingjiao Chen et.al.|[2502.14815v1](http://arxiv.org/abs/2502.14815v1)|null| -|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| -|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| -|**2025-02-20**|**A Survey on Text-Driven 360-Degree Panorama Generation**|Hai Wang et.al.|[2502.14799v1](http://arxiv.org/abs/2502.14799v1)|null| -|**2025-02-20**|**Rapid Word Learning Through Meta In-Context Learning**|Wentao Wang et.al.|[2502.14791v1](http://arxiv.org/abs/2502.14791v1)|null| -|**2025-02-20**|**SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features**|Michael Tschannen et.al.|[2502.14786v1](http://arxiv.org/abs/2502.14786v1)|[link](https://github.com/google-research/big_vision)| -|**2025-02-20**|**ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting**|Abhijit Mishra et.al.|[2502.14780v1](http://arxiv.org/abs/2502.14780v1)|null| -|**2025-02-20**|**Harnessing PDF Data for Improving Japanese Large Multimodal Models**|Jeonghun Baek et.al.|[2502.14778v1](http://arxiv.org/abs/2502.14778v1)|null| -|**2025-02-20**|**Making Universal Policies Universal**|Niklas Höpner et.al.|[2502.14777v1](http://arxiv.org/abs/2502.14777v1)|null| -|**2025-02-20**|**SurveyX: Academic Survey Automation via Large Language Models**|Xun Liang et.al.|[2502.14776v1](http://arxiv.org/abs/2502.14776v1)|null| -|**2025-02-20**|**Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning**|Tian Xie et.al.|[2502.14768v1](http://arxiv.org/abs/2502.14768v1)|null| -|**2025-02-20**|**Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis**|Priyanka Kargupta et.al.|[2502.14767v1](http://arxiv.org/abs/2502.14767v1)|null| -|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| -|**2025-02-20**|**EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations**|Haotian Zhai et.al.|[2502.14760v1](http://arxiv.org/abs/2502.14760v1)|null| -|**2025-02-20**|**On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems**|Juraj Vladika et.al.|[2502.14759v1](http://arxiv.org/abs/2502.14759v1)|null| -|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| -|**2025-02-20**|**TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators**|Jianling Li et.al.|[2502.14752v1](http://arxiv.org/abs/2502.14752v1)|null| -|**2025-02-20**|**Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs**|Zongxia Li et.al.|[2502.14748v1](http://arxiv.org/abs/2502.14748v1)|null| -|**2025-02-20**|**HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States**|Yilei Jiang et.al.|[2502.14744v1](http://arxiv.org/abs/2502.14744v1)|null| -|**2025-02-20**|**Multi-Agent Coordination across Diverse Applications: A Survey**|Lijun Sun et.al.|[2502.14743v1](http://arxiv.org/abs/2502.14743v1)|null| -|**2025-02-20**|**YOLOv12: A Breakdown of the Key Architectural Features**|Mujadded Al Rabbani Alif et.al.|[2502.14740v1](http://arxiv.org/abs/2502.14740v1)|null| -|**2025-02-20**|**SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines**|M-A-P Team et.al.|[2502.14739v1](http://arxiv.org/abs/2502.14739v1)|null| -|**2025-02-20**|**EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration**|Minjie Hong et.al.|[2502.14735v1](http://arxiv.org/abs/2502.14735v1)|null| -|**2025-02-20**|**Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models**|Hongji Li et.al.|[2502.14734v1](http://arxiv.org/abs/2502.14734v1)|null| -|**2025-02-20**|**WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models**|Yifu Chen et.al.|[2502.14727v1](http://arxiv.org/abs/2502.14727v1)|null| -|**2025-02-20**|**Entity Framing and Role Portrayal in the News**|Tarek Mahmoud et.al.|[2502.14718v1](http://arxiv.org/abs/2502.14718v1)|null| -|**2025-02-20**|**From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT**|Ahmed Abdeen Hamed et.al.|[2502.14714v1](http://arxiv.org/abs/2502.14714v1)|null| -|**2025-02-20**|**Data-Efficient Pretraining with Group-Level Data Influence Modeling**|Zichun Yu et.al.|[2502.14709v1](http://arxiv.org/abs/2502.14709v1)|null| -|**2025-02-20**|**Human Misperception of Generative-AI Alignment: A Laboratory Experiment**|Kevin He et.al.|[2502.14708v1](http://arxiv.org/abs/2502.14708v1)|null| -|**2025-02-20**|**Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting**|Yuxuan Yang et.al.|[2502.14704v1](http://arxiv.org/abs/2502.14704v1)|null| -|**2025-02-20**|**I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search**|Zujie Liang et.al.|[2502.14693v1](http://arxiv.org/abs/2502.14693v1)|null| -|**2025-02-20**|**Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup**|Yonghui Kong et.al.|[2502.14682v1](http://arxiv.org/abs/2502.14682v1)|null| -|**2025-02-20**|**How to Get Your LLM to Generate Challenging Problems for Evaluation**|Arkil Patel et.al.|[2502.14678v1](http://arxiv.org/abs/2502.14678v1)|null| -|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| -|**2025-02-20**|**BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction**|Ruochen Li et.al.|[2502.14676v1](http://arxiv.org/abs/2502.14676v1)|null| -|**2025-02-20**|**Explanations of Deep Language Models Explain Language Representations in the Brain**|Maryam Rahimi et.al.|[2502.14671v1](http://arxiv.org/abs/2502.14671v1)|null| -|**2025-02-20**|**AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO**|Alan Dao et.al.|[2502.14669v1](http://arxiv.org/abs/2502.14669v1)|null| -|**2025-02-20**|**InstructAgent: Building User Controllable Recommender via LLM Agent**|Wujiang Xu et.al.|[2502.14662v1](http://arxiv.org/abs/2502.14662v1)|null| -|**2025-02-20**|**Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs**|Yuchen Wu et.al.|[2502.14645v1](http://arxiv.org/abs/2502.14645v1)|null| -|**2025-02-20**|**LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning**|Yansheng Mao et.al.|[2502.14644v1](http://arxiv.org/abs/2502.14644v1)|null| -|**2025-02-20**|**Length-Controlled Margin-Based Preference Optimization without Reference Model**|Gengxu Li et.al.|[2502.14643v1](http://arxiv.org/abs/2502.14643v1)|null| -|**2025-02-20**|**How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation**|Rui Li et.al.|[2502.14642v1](http://arxiv.org/abs/2502.14642v1)|null| -|**2025-02-20**|**NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization**|Zheyuan Zhang et.al.|[2502.14638v1](http://arxiv.org/abs/2502.14638v1)|null| -|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| -|**2025-02-20**|**PEARL: Towards Permutation-Resilient LLMs**|Liang Chen et.al.|[2502.14628v1](http://arxiv.org/abs/2502.14628v1)|null| -|**2025-02-20**|**ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors**|Yuguo Yin et.al.|[2502.14627v1](http://arxiv.org/abs/2502.14627v1)|null| -|**2025-02-20**|**Multi-Record Web Page Information Extraction From News Websites**|Alexander Kustenkov et.al.|[2502.14625v1](http://arxiv.org/abs/2502.14625v1)|null| -|**2025-02-20**|**Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity**|Xinghan Pan et.al.|[2502.14620v1](http://arxiv.org/abs/2502.14620v1)|[link](https://github.com/PStarH/RWKV-embedding)| -|**2025-02-20**|**Reward Models Identify Consistency, Not Causality**|Yuhui Xu et.al.|[2502.14619v1](http://arxiv.org/abs/2502.14619v1)|null| -|**2025-02-20**|**FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis**|Mingyi Jia et.al.|[2502.14614v1](http://arxiv.org/abs/2502.14614v1)|null| -|**2025-02-20**|**Behavioral Analysis of Information Salience in Large Language Models**|Jan Trienes et.al.|[2502.14613v1](http://arxiv.org/abs/2502.14613v1)|null| -|**2025-02-20**|**A Theory for Conditional Generative Modeling on Multiple Data Sources**|Rongzhen Wang et.al.|[2502.14583v1](http://arxiv.org/abs/2502.14583v1)|null| -|**2025-02-20**|**A Statistical Case Against Empirical Human-AI Alignment**|Julian Rodemann et.al.|[2502.14581v1](http://arxiv.org/abs/2502.14581v1)|null| -|**2025-02-20**|**ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification**|Hyunseok Lee et.al.|[2502.14565v1](http://arxiv.org/abs/2502.14565v1)|null| -|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| -|**2025-02-20**|**Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs**|Paris Koloveas et.al.|[2502.14561v1](http://arxiv.org/abs/2502.14561v1)|null| -|**2025-02-20**|**Less is More: Improving LLM Alignment via Preference Data Selection**|Xun Deng et.al.|[2502.14560v1](http://arxiv.org/abs/2502.14560v1)|null| -|**2025-02-20**|**FUIA: Model Inversion Attack against Federated Unlearning**|Lei Zhou et.al.|[2502.14558v1](http://arxiv.org/abs/2502.14558v1)|null| -|**2025-02-20**|**Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling**|Eric Egli et.al.|[2502.14553v1](http://arxiv.org/abs/2502.14553v1)|null| -|**2025-02-20**|**Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks**|Maya Bechler-Speicher et.al.|[2502.14546v1](http://arxiv.org/abs/2502.14546v1)|null| -|**2025-02-20**|**LLM-based User Profile Management for Recommender System**|Seunghwan Bang et.al.|[2502.14541v1](http://arxiv.org/abs/2502.14541v1)|null| -|**2025-02-20**|**LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization**|Yupeng Chang et.al.|[2502.14538v1](http://arxiv.org/abs/2502.14538v1)|null| -|**2025-02-20**|**CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models**|Zhenhong Zhou et.al.|[2502.14529v1](http://arxiv.org/abs/2502.14529v1)|null| -|**2025-02-20**|**Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting**|Yannick Wölker et.al.|[2502.14525v1](http://arxiv.org/abs/2502.14525v1)|null| -|**2025-02-20**|**Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation**|Austin A. Barr et.al.|[2502.14523v1](http://arxiv.org/abs/2502.14523v1)|null| -|**2025-02-20**|**MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality**|Artur Kot et.al.|[2502.14509v1](http://arxiv.org/abs/2502.14509v1)|null| -|**2025-02-20**|**Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases**|Rena Gao et.al.|[2502.14507v1](http://arxiv.org/abs/2502.14507v1)|null| -|**2025-02-20**|**PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models**|Yu Meng et.al.|[2502.14504v1](http://arxiv.org/abs/2502.14504v1)|null| -|**2025-02-20**|**How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?**|Sergey Pletenev et.al.|[2502.14502v1](http://arxiv.org/abs/2502.14502v1)|null| -|**2025-02-20**|**Towards a Perspectivist Turn in Argument Quality Assessment**|Julia Romberg et.al.|[2502.14501v1](http://arxiv.org/abs/2502.14501v1)|null| -|**2025-02-20**|**MLGym: A New Framework and Benchmark for Advancing AI Research Agents**|Deepak Nathani et.al.|[2502.14499v1](http://arxiv.org/abs/2502.14499v1)|null| -|**2025-02-20**|**Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups**|Felix Drinkall et.al.|[2502.14497v1](http://arxiv.org/abs/2502.14497v1)|null| -|**2025-02-20**|**Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization**|Zhitao He et.al.|[2502.14496v1](http://arxiv.org/abs/2502.14496v1)|null| -|**2025-02-20**|**StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following**|Jinnan Li et.al.|[2502.14494v1](http://arxiv.org/abs/2502.14494v1)|null| -|**2025-02-20**|**Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk**|Elija Perrier et.al.|[2502.14491v1](http://arxiv.org/abs/2502.14491v1)|null| -|**2025-02-20**|**Temporal Misalignment and Probabilistic Neurons**|Velibor Bojković et.al.|[2502.14487v1](http://arxiv.org/abs/2502.14487v1)|null| -|**2025-02-20**|**How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation**|Zhuohang Long et.al.|[2502.14486v1](http://arxiv.org/abs/2502.14486v1)|null| -|**2025-02-20**|**NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models**|Chenlu Guo et.al.|[2502.14482v1](http://arxiv.org/abs/2502.14482v1)|null| -|**2025-02-20**|**Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression**|Haoyu Wang et.al.|[2502.14477v1](http://arxiv.org/abs/2502.14477v1)|null| -|**2025-02-20**|**Argument-Based Comparative Question Answering Evaluation Benchmark**|Irina Nikishina et.al.|[2502.14476v1](http://arxiv.org/abs/2502.14476v1)|null| -|**2025-02-20**|**Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models**|Aurora Polo-Rodríguez et.al.|[2502.14469v1](http://arxiv.org/abs/2502.14469v1)|null| -|**2025-02-20**|**Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing**|Aviv Bick et.al.|[2502.14458v1](http://arxiv.org/abs/2502.14458v1)|null| -|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| -|**2025-02-20**|**Optimal word order for non-causal text generation with Large Language Models: the Spanish case**|Andrea Busto-Castiñeira et.al.|[2502.14451v1](http://arxiv.org/abs/2502.14451v1)|null| +摘要:大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能,但由於缺乏對分子結構的基本理解,它們遠非真正的通才分子 LLM。具體來說,當給予分子任務說明時,使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子,這顯示出它們缺乏對分子結構的理解,而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM,我們引入了一種新穎的多模態訓練方法,該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中,所提出的通才分子 LLM(稱為 Mol-LLM)在多數任務中實現了通才 LLM 中的最新效能,同時超越或與最新的專家 LLM 相當。此外,Mol-LLM 在反應預測任務中也展現出優異的泛化效能,證明了分子結構理解對泛化觀點的影響。 -#### Abstracts -##### **LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention** -2502.14866v1 by Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han +##### **Leveraging the true depth of LLMs** +2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret -Large language models (LLMs) have shown remarkable potential in processing -long sequences, yet efficiently serving these long-context models remains -challenging due to the quadratic computational complexity of attention in the -prefilling stage and the large memory footprint of the KV cache in the decoding -stage. To address these issues, we introduce LServe, an efficient system that -accelerates long-sequence LLM serving via hybrid sparse attention. This method -unifies different hardware-friendly, structured sparsity patterns for both -prefilling and decoding attention into a single framework, where computations -on less important tokens are skipped block-wise. LServe demonstrates the -compatibility of static and dynamic sparsity in long-context LLM attention. -This design enables multiplicative speedups by combining these optimizations. -Specifically, we convert half of the attention heads to nearly free streaming -heads in both the prefilling and decoding stages. Additionally, we find that -only a constant number of KV pages is required to preserve long-context -capabilities, irrespective of context length. We then design a hierarchical KV -page selection policy that dynamically prunes KV pages based on query-centric -similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and -decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is -released at https://github.com/mit-han-lab/omniserve. +Large Language Models demonstrate remarkable capabilities at the cost of high +compute requirements. While recent research has shown that intermediate layers +can be removed or have their order shuffled without impacting performance +significantly, these findings have not been employed to reduce the +computational cost of inference. We investigate several potential ways to +reduce the depth of pre-trained LLMs without significantly affecting +performance. Leveraging our insights, we present a novel approach that exploits +this decoupling between layers by grouping some of them into pairs that can be +evaluated in parallel. + This modification of the computational graph -- through better parallelism -- +results in an average improvement of around 1.20x on the number of tokens +generated per second, without re-training nor fine-tuning, while retaining +95%-99% of the original accuracy. Empirical evaluation demonstrates that this +approach significantly improves serving efficiency while maintaining model +performance, offering a practical improvement for large-scale LLM deployment. -摘要:大型語言模型 (LLM) 在處理長序列方面展現出驚人的潛力,但由於預填充階段注意力的二次計算複雜度和解碼階段 KV 快取的大量記憶體使用量,有效提供這些長語境模型服務仍然具有挑戰性。為了解決這些問題,我們引入了 LServe,一個透過混合稀疏注意力加速長序列 LLM 服務的高效系統。此方法將不同的硬體友善的結構化稀疏模式統一到一個單一的架構中,用於預填充和解碼注意力,其中對較不重要的符號的運算會以區塊方式略過。LServe 證明了靜態和動態稀疏性在長語境 LLM 注意力中的相容性。此設計透過結合這些最佳化來實現倍增加速。具體來說,我們將一半的注意力頭轉換為預填充和解碼階段中幾乎免費的串流頭。此外,我們發現僅需要恆定的 KV 頁數來保留長語境功能,而與語境長度無關。然後,我們設計了一個分層式 KV 頁面選擇策略,根據以查詢為中心的相似性動態刪除 KV 頁面。平均而言,LServe 將 LLM 預填充加速了 2.9 倍,將解碼加速了 1.3-2.1 倍,同時維持長語境的準確性。程式碼已發布在 https://github.com/mit-han-lab/omniserve。 +摘要:大型语言模型展示了其强大的功能,但代价是较高的计算需求。虽然最近的研究表明,中间层可以被移除或重新排列其顺序,而不会显著影响性能,但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度,而不会显著影响性能。利用我们的见解,我们提出了一种新颖的方法,该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。 +通过更好的并行性对计算图进行修改,平均而言,每秒生成的令牌数量提高了约 1.20 倍,而无需重新训练或微调,同时保留了 95%-99% 的原始准确性。经验评估表明,这种方法显著提高了服务效率,同时保持了模型性能,为大规模 LLM 部署提供了实际改进。 -##### **Interpretable Text Embeddings and Text Similarity Explanation: A Primer** -2502.14862v1 by Juri Opitz, Lucas Möller, Andrianos Michail, Simon Clematide +##### **Modular Training of Neural Networks aids Interpretability** +2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots -Text embeddings and text embedding models are a backbone of many AI and NLP -systems, particularly those involving search. However, interpretability -challenges persist, especially in explaining obtained similarity scores, which -is crucial for applications requiring transparency. In this paper, we give a -structured overview of interpretability methods specializing in explaining -those similarity scores, an emerging research area. We study the methods' -individual ideas and techniques, evaluating their potential for improving -interpretability of text embeddings and explaining predicted similarities. +An approach to improve neural network interpretability is via clusterability, +i.e., splitting a model into disjoint clusters that can be studied +independently. We define a measure for clusterability and show that pre-trained +models form highly enmeshed clusters via spectral graph clustering. We thus +train models to be more modular using a "clusterability loss" function that +encourages the formation of non-interacting clusters. Using automated +interpretability techniques, we show that our method can help train models that +are more modular and learn different, disjoint, and smaller circuits. We +investigate CNNs trained on MNIST and CIFAR, small transformers trained on +modular addition, and language models. Our approach provides a promising +direction for training neural networks that learn simpler functions and are +easier to interpret. -摘要:文字嵌入和文字嵌入模型是許多 AI 和 NLP 系統的骨幹,特別是那些涉及搜尋的系統。然而,可解釋性的挑戰依然存在,特別是在解釋獲得的相似度分數時,這對於需要透明度的應用程式至關重要。在本文中,我們對專門用於解釋這些相似度分數的可解釋性方法給予結構化的概述,這是一個新興的研究領域。我們研究了這些方法的個別想法和技術,評估它們改善文字嵌入的可解釋性和解釋預測相似度的潛力。 +摘要:一種改善神經網路可解釋性的方法是透過群集性, +也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量,並顯示預訓練的 +模型透過光譜圖形群集形成高度糾纏的群集。因此,我們使用「群集性損失」函數訓練模型,使其更具模組化, +這鼓勵形成非交互群集。使用自動化可解釋性技術,我們顯示我們的模型可以幫助訓練更具模組化的模型,並學習不同、不相交且較小的電路。我們 +研究了在 MNIST 和 CIFAR 上訓練的 CNN,在模組化加法上訓練的小型Transformer,以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。 -##### **Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning** -2502.14860v1 by Shuyue Stella Li, Jimin Mun, Faeze Brahman, Jonathan S. Ilgen, Yulia Tsvetkov, Maarten Sap +##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs** +2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür -Large language models (LLMs) often fail to ask effective questions under -uncertainty, making them unreliable in domains where proactive -information-gathering is essential for decisionmaking. We present ALFA, a -framework that improves LLM question-asking by (i) decomposing the notion of a -"good" question into a set of theory-grounded attributes (e.g., clarity, -relevance), (ii) controllably synthesizing attribute-specific question -variations, and (iii) aligning models via preference-based optimization to -explicitly learn to ask better questions along these fine-grained attributes. -Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs -dataset, composed of 17k real-world clinical interactions augmented with 80k -attribute-specific preference pairs of follow-up questions, as well as a novel -expert-annotated interactive healthcare QA task to evaluate question-asking -abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on -MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level -win-rate of 64.4% and strong generalizability. Our findings suggest that -explicitly guiding question-asking with structured, fine-grained attributes -offers a scalable path to improve LLMs, especially in expert application -domains. +Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large +language models (LLMs) by enabling detailed step-by-step solutions. However, +due to the verbosity of LLMs, the resulting reasoning chains can be long, +making it harder to verify the reasoning steps and trace issues resulting from +dependencies between the steps that may be farther away in the sequence of +steps. Importantly, mathematical reasoning allows each step to be derived from +a small set of premises, which are a subset of the preceding steps in the +reasoning chain. In this paper, we present a framework that identifies the +premises for each step, to improve the evaluation of reasoning. We restructure +conventional linear reasoning chains into Premise Augmented Reasoning Chains +(PARC) by introducing premise links, resulting in a directed acyclic graph +where the nodes are the steps and the edges are the premise links. Through +experiments with a PARC-based dataset that we built, namely PERL (Premises and +ERrors identification in LLMs), we demonstrate that LLMs can reliably identify +premises within complex reasoning chains. In particular, even open-source LLMs +achieve 90% recall in premise identification. We also show that PARC helps to +identify errors in reasoning chains more reliably. The accuracy of error +identification improves by 6% to 16% absolute when step-by-step verification is +carried out in PARC under the premises. Our findings highlight the utility of +premise-centric representations in addressing complex problem-solving tasks and +open new avenues for improving the reliability of LLM-based reasoning +evaluations. -摘要:大型語言模型 (LLM) 經常在不確定性下無法提出有效問題,這使得它們在主動收集資訊對於決策制定至關重要的領域中不可靠。我們提出 ALFA,一個透過 (i) 將「良好」問題的概念分解成一組以理論為基礎的屬性(例如,清晰度、相關性),(ii) 可控地合成屬性特定的問題變體,以及 (iii) 透過基於偏好的最佳化調整模型,明確學習沿著這些細緻屬性提出更好的問題,來改善 LLM 提問的架構。專注於臨床推理作為案例研究,我們引入了 MediQ-AskDocs 資料集,由 17k 個真實世界的臨床互動組成,並增加了 80k 個屬性特定的後續問題偏好配對,以及一個由專家註解的互動式醫療保健問答任務來評估提問能力。與 SOTA 指令調整的 LLM 相比,與 ALFA 對齊的模型將 MediQ-AskDocs 上的診斷錯誤減少了 56.6%,問題層級的勝率為 64.4%,並且具有很強的普遍性。我們的研究結果表明,明確地以結構化、細緻的屬性來引導提問,提供了一條可擴充的途徑來改善 LLM,特別是在專家應用領域。 +摘要:思考鏈(CoT)提示透過提供詳細的逐步解法,增強大型語言模型(LLM)的數學推理能力。然而,由於 LLM 的冗長,產生的推理鏈可能很長,這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難,而這些步驟可能在步驟順序中相距較遠。重要的是,數學推理允許每個步驟從一組小的前提中推導出來,這些前提是推理鏈中前一個步驟的子集。在本文中,我們提出了一個框架,用於識別每個步驟的前提,以改進推理評估。我們透過引入前提連結,將傳統的線性推理鏈重組為前提擴充推理鏈(PARC),產生一個有向無環圖,其中節點是步驟,而邊緣是前提連結。透過我們建立的基於 PARC 的資料集(即 PERL(LLM 中的前提和錯誤識別))進行的實驗,我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是,即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明,PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時,錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用,並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。 -##### **FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling** -2502.14856v1 by Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun +##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement** +2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna -Speculative sampling has emerged as an important technique for accelerating -the auto-regressive generation process of large language models (LLMs) by -utilizing a draft-then-verify mechanism to produce multiple tokens per forward -pass. While state-of-the-art speculative sampling methods use only a single -layer and a language modeling (LM) head as the draft model to achieve -impressive layer compression, their efficiency gains are substantially reduced -for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. -To address this, we present FR-Spec, a frequency-ranked speculative sampling -framework that optimizes draft candidate selection through vocabulary space -compression. By constraining the draft search to a frequency-prioritized token -subset, our method reduces LM Head computation overhead by 75% while ensuring -the equivalence of the final output distribution. Experiments across multiple -datasets demonstrate an average of 1.12$\times$ speedup over the -state-of-the-art speculative sampling method EAGLE-2. +Embodied agents assisting humans are often asked to complete a new task in a +new scenario. An agent preparing a particular dish in the kitchen based on a +known recipe may be asked to prepare a new dish or to perform cleaning tasks in +the storeroom. There may not be sufficient resources, e.g., time or labeled +examples, to train the agent for these new situations. Large Language Models +(LLMs) trained on considerable knowledge across many domains are able to +predict a sequence of abstract actions for such new tasks and scenarios, +although it may not be possible for the agent to execute this action sequence +due to task-, agent-, or domain-specific constraints. Our framework addresses +these challenges by leveraging the generic predictions provided by LLM and the +prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an +agent to quickly adapt to new tasks and scenarios. The robot also solicits and +uses human input as needed to refine its existing knowledge. Based on +experimental evaluation over cooking and cleaning tasks in simulation domains, +we demonstrate that the interplay between LLM, KG, and human input leads to +substantial performance gains compared with just using the LLM output. -摘要:推測取樣已成為一種重要的技術,可用於透過利用先起草後驗證的機制來加速大型語言模型 (LLM) 的自迴歸生成過程,並在每次前向傳遞中產生多個代幣。儘管最先進的推測取樣方法只使用單一層和語言建模 (LM) 頭作為起草模型,以達成令人印象深刻的層壓縮,但對於大型詞彙表 LLM(例如詞彙表包含 128k 個代幣的 Llama-3-8B),其效率提升會大幅降低。為了解決這個問題,我們提出了 FR-Spec,這是一種頻率排序推測取樣架構,它透過詞彙空間壓縮來最佳化起草候選選取。我們的這個方法透過將起草搜尋限制在優先於頻率的代幣子集中,將 LM 頭部運算開銷減少了 75%,同時確保最終輸出分佈的等效性。透過多個資料集的實驗證明,與最先進的推測取樣方法 EAGLE-2 相比,平均提速了 1.12 倍。 +摘要:具身代理协助人类时,通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源(例如时间或标记的示例)来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列,尽管代理可能无法执行此动作序列,因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战,使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估,我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。 -##### **Prompt-to-Leaderboard** -2502.14855v1 by Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica +##### **On Bob Dylan: A Computational Perspective** +2502.01772v1 by Prashant Garg -Large language model (LLM) evaluations typically rely on aggregated metrics -like accuracy or human preference, averaging across users and prompts. This -averaging obscures user- and prompt-specific variations in model performance. -To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces -leaderboards specific to a prompt. The core idea is to train an LLM taking -natural language prompts as input to output a vector of Bradley-Terry -coefficients which are then used to predict the human preference vote. The -resulting prompt-dependent leaderboards allow for unsupervised task-specific -evaluation, optimal routing of queries to models, personalization, and -automated evaluation of model strengths and weaknesses. Data from Chatbot Arena -suggest that P2L better captures the nuanced landscape of language model -performance than the averaged leaderboard. Furthermore, our findings suggest -that P2L's ability to produce prompt-specific evaluations follows a power law -scaling similar to that observed in LLMs themselves. In January 2025, the -router we trained based on this methodology achieved the \#1 spot in the -Chatbot Arena leaderboard. Our code is available at this GitHub link: -https://github.com/lmarena/p2l. +Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style +-- a constant refusal to conform to expectation and a penchant for reinventing +his musical and lyrical identity. In this paper, I extend Sunstein's +observations through a large-scale computational analysis of Dylan's lyrics +from 1962 to 2012. Using o3-mini-high (a large language model), I extract +concept-to-concept relationships from the lyrics and construct directed +knowledge graphs that capture Dylan's thematic structure. I then quantify +shifts in sentiment, metaphorical expression, thematic diversity, and network +complexity over time. The results indicate that Dylan's lyrics increasingly +rely on metaphor, display an evolving sentiment profile, and exhibit heightened +dishabituation -- measured here as a growing variance in the network centrality +of key concepts. I also find that references to movement, protest, and mythic +imagery fluctuate in ways that align with well-known phases of Dylan's career, +reflecting the dynamic and unpredictable quality of his art. These findings not +only deepen our empirical understanding of Sunstein's thesis but also introduce +a novel computational method for analyzing an artist's evolution-offering +broader applicability to the study of cultural and creative change. -摘要:大型語言模型 (LLM) 評估通常依賴於彙總的指標,例如準確性或人類偏好,平均值跨使用者和提示。此平均值模糊了使用者和提示特定的模型效能變異。為了解決此問題,我們提出提示到排行榜 (P2L),一種產生特定於提示的排行榜的方法。核心概念是訓練 LLM,將自然語言提示作為輸入,以輸出 Bradley-Terry 係數向量,然後用於預測人類偏好投票。產生的提示相關排行榜允許無監督任務特定評估、最佳查詢路由至模型、個人化以及模型優缺點的自動化評估。來自 Chatbot Arena 的資料表明,P2L 比平均排行榜更能捕捉語言模型效能的細微變化。此外,我們的研究結果表明,P2L 產生提示特定評估的能力遵循類似於 LLM 本身觀察到的冪律縮放。2025 年 1 月,我們根據此方法訓練的路由器在 Chatbot Arena 排行榜中獲得了第一名。我們的程式碼可在 GitHub 連結取得:https://github.com/lmarena/p2l。 +摘要:卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格 +-- 這種風格不斷拒絕符合預期,並熱衷於重新塑造他的音樂和歌詞認同。在本文中,我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析,來延伸桑斯坦的觀察。使用 o3-mini-high(一個大型語言模型),我從歌詞中提取概念對概念的關係,並建構有向知識圖,以捕捉迪倫的主題結構。然後,我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示,迪倫的歌詞越來越依賴隱喻,展現出不斷演化的情緒輪廓,並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現,對運動、抗議和神話意象的引用,會以與迪倫職業生涯中眾所周知階段一致的方式波動,反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解,也引入了分析藝術家演變的新穎運算方法,為文化和創造性變化的研究提供了更廣泛的適用性。 -##### **CLIPPER: Compression enables long-context synthetic data generation** -2502.14854v1 by Chau Minh Pham, Yapei Chang, Mohit Iyyer +##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos** +2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang -LLM developers are increasingly reliant on synthetic data, but generating -high-quality data for complex long-context reasoning tasks remains challenging. -We introduce CLIPPER, a compression-based approach for generating synthetic -data tailored to narrative claim verification - a task that requires reasoning -over a book to verify a given claim. Instead of generating claims directly from -the raw text of the book, which results in artifact-riddled claims, CLIPPER -first compresses the book into chapter outlines and book summaries and then -uses these intermediate representations to generate complex claims and -corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces -claims that are more valid, grounded, and complex. Using CLIPPER, we construct -a dataset of 19K synthetic book claims paired with their source texts and -chain-of-thought reasoning, and use it to fine-tune three open-weight models. -Our best model achieves breakthrough results on narrative claim verification -(from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for -sub-10B models on the NoCha leaderboard. Further analysis shows that our models -generate more detailed and grounded chain-of-thought reasoning while also -improving performance on other narrative understanding tasks (e.g., -NarrativeQA). +Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in +enhancing Large Language Models (LLMs) through external knowledge integration, +yet its application has primarily focused on textual content, leaving the rich +domain of multi-modal video knowledge predominantly unexplored. This paper +introduces VideoRAG, the first retrieval-augmented generation framework +specifically designed for processing and understanding extremely long-context +videos. Our core innovation lies in its dual-channel architecture that +seamlessly integrates (i) graph-based textual knowledge grounding for capturing +cross-video semantic relationships, and (ii) multi-modal context encoding for +efficiently preserving visual features. This novel design empowers VideoRAG to +process unlimited-length videos by constructing precise knowledge graphs that +span multiple videos while maintaining semantic dependencies through +specialized multi-modal retrieval paradigms. Through comprehensive empirical +evaluation on our proposed LongerVideos benchmark-comprising over 160 videos +totaling 134+ hours across lecture, documentary, and entertainment +categories-VideoRAG demonstrates substantial performance compared to existing +RAG alternatives and long video understanding methods. The source code of +VideoRAG implementation and the benchmark dataset are openly available at: +https://github.com/HKUDS/VideoRAG. -摘要:LLM 開發人員越來越依賴合成資料,但為複雜的長語境推理任務生成高品質資料仍然具有挑戰性。我們引入了 CLIPPER,一種基於壓縮的方法,用於生成針對敘事性聲明驗證量身打造的合成資料,這項任務需要對一本書進行推理才能驗證給定的聲明。CLIPPER 沒有直接從書籍的原始文字生成聲明,這會產生充滿人工製品的聲明,而是先將書籍壓縮成章節大綱和書籍摘要,然後使用這些中間表示來生成複雜的聲明和對應的思維鏈。與天真的方法相比,CLIPPER 產生的聲明更有效、更有根據且更複雜。使用 CLIPPER,我們構建了一個包含 19K 個合成書籍聲明及其原始文字和思維鏈推理的資料集,並用於微調三個開放權重模型。我們最好的模型在敘事性聲明驗證方面取得了突破性的結果(在我們的測試集中準確率從 28% 提升到 76%),並在 NoCha 排行榜上為低於 10B 的模型設定了新的技術水準。進一步的分析表明,我們的模型生成了更詳細且有根據的思維鏈推理,同時也提高了其他敘事理解任務(例如 NarrativeQA)的效能。 +摘要:檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功,但其應用主要集中在文字內容上,而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG,這是第一個檢索增強生成架構,專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構,它無縫整合 (i) 基於圖形文字知識基礎,用於擷取跨影片語義關係,以及 (ii) 多模態語境編碼,用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片,同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估,該基準包含超過 160 部影片,總時數超過 134 小時,涵蓋演講、紀錄片和娛樂類別,VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比,展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於:https://github.com/HKUDS/VideoRAG。 -##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** -2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu +##### **Transformers trained on proteins can learn to attend to Euclidean distance** +2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane -Large Language Models (LLMs) have shown great promise in tool-making, yet -existing frameworks often struggle to efficiently construct reliable toolsets -and are limited to single-task settings. To address these challenges, we -propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that -dynamically constructs and evolves a hierarchical graph of reusable tools -across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), -agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, -TabMWP). Our results show that GATE achieves up to 4.3x faster milestone -completion in Minecraft compared to the previous SOTA, and provides an average -improvement of 9.23% over existing tool-making methods in code generation tasks -and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, -balancing tool quantity, complexity, and functionality while maintaining high -efficiency. Code and data are available at -\url{https://github.com/ayanami2003/GATE}. +While conventional Transformers generally operate on sequence data, they can +be used in conjunction with structure models, typically SE(3)-invariant or +equivariant graph neural networks (GNNs), for 3D applications such as protein +structure modelling. These hybrids typically involve either (1) +preprocessing/tokenizing structural features as input for Transformers or (2) +taking Transformer embeddings and processing them within a structural +representation. However, there is evidence that Transformers can learn to +process structural information on their own, such as the AlphaFold3 structural +diffusion model. In this work we show that Transformers can function +independently as structure models when passed linear embeddings of coordinates. +We first provide a theoretical explanation for how Transformers can learn to +filter attention as a 3D Gaussian with learned variance. We then validate this +theory using both simulated 3D points and in the context of masked token +prediction for proteins. Finally, we show that pre-training protein Transformer +encoders with structure improves performance on a downstream task, yielding +better performance than custom structural models. Together, this work provides +a basis for using standard Transformers as hybrid structure-language models. -摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 +摘要:雖然傳統的 Transformer 通常處理序列資料,但它們可用於結構模型,通常是 SE(3) 不變式或等變式圖神經網路 (GNN),用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而,有證據表明 Transformer 可以自行學習處理結構資訊,例如 AlphaFold3 結構擴散模型。在這項工作中,我們展示了 Transformer 在傳遞座標的線性嵌入時,可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後,我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能,產生比自訂結構模型更好的效能。綜合來說,這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。 + + +### Medical explainable AI +|Publish Date|Title|Authors|Homepage|Code| +| :---: | :---: | :---: | :---: | :---: | +|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| +|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| +|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| +|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| +|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null| +|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)| +|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null| +|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null| +|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v2](http://arxiv.org/abs/2501.09628v2)|null| +|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null| +|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null| +|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null| +|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null| +|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null| +|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)| +|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null| +|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null| +|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null| +|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null| +|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null| +|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null| +|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null| +|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null| +|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null| +|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)| +|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null| +|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null| +|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null| +|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)| +|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)| +|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null| +|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)| +|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null| +|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null| +|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null| +|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null| +|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null| +|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null| +|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null| +|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null| +|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare**|Antonio Rago et.al.|[2408.17401v2](http://arxiv.org/abs/2408.17401v2)|null| +|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null| +|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null| +|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null| +|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null| +|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null| +|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null| +|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null| +|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null| +|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null| +|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null| +|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null| +|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null| +|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null| +|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null| +|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)| +|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null| +|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null| +|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null| +|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null| +|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null| +|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null| +|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)| +|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null| +|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null| +|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null| +|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null| +|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null| +|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)| +|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null| +|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)| +|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null| +|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null| +|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null| +|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null| +|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null| +|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null| +|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null| +|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null| +|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null| +|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null| +|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null| +|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null| +|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)| +|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null| +|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null| +|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null| +|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)| +|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)| +|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null| +|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null| +|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null| +|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null| +|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null| +|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null| +|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null| +|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)| +|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null| +|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null| +|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null| -##### **Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation** -2502.14846v1 by Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark +#### Abstracts +##### **Towards a perturbation-based explanation for medical AI as differentiable programs** +2502.14001v1 by Takeshi Abe, Yoshiyuki Asai -Reasoning about images with rich text, such as charts and documents, is a -critical application of vision-language models (VLMs). However, VLMs often -struggle in these domains due to the scarcity of diverse text-rich -vision-language data. To address this challenge, we present CoSyn, a framework -that leverages the coding capabilities of text-only large language models -(LLMs) to automatically create synthetic text-rich multimodal data. Given input -text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts -an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic -images. With the underlying code as textual representations of the synthetic -images, CoSyn can generate high-quality instruction-tuning data, again relying -on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K -images and 2.7M rows of vision-language instruction-tuning data. Comprehensive -experiments on seven benchmarks demonstrate that models trained on our -synthetic data achieve state-of-the-art performance among competitive -open-source models, including Llama 3.2, and surpass proprietary models such as -GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing -data, enabling VLMs to ground information within input images, showcasing its -potential for developing multimodal agents capable of acting in real-world -environments. +Recent advancement in machine learning algorithms reaches a point where +medical devices can be equipped with artificial intelligence (AI) models for +diagnostic support and routine automation in clinical settings. In medicine and +healthcare, there is a particular demand for sufficient and objective +explainability of the outcome generated by AI models. However, AI models are +generally considered as black boxes due to their complexity, and the +computational process leading to their response is often opaque. Although +several methods have been proposed to explain the behavior of models by +evaluating the importance of each feature in discrimination and prediction, +they may suffer from biases and opacities arising from the scale and sampling +protocol of the dataset used for training or testing. To overcome the +shortcomings of existing methods, we explore an alternative approach to provide +an objective explanation of AI models that can be defined independently of the +learning process and does not require additional data. As a preliminary study +for this direction of research, this work examines a numerical availability of +the Jacobian matrix of deep learning models that measures how stably a model +responses against small perturbations added to the input. The indicator, if +available, are calculated from a trained AI model for a given target input. +This is a first step towards a perturbation-based explanation, which will +assist medical practitioners in understanding and interpreting the response of +the AI model in its clinical application. -摘要:透過豐富文字(例如圖表和文件)對影像進行推理,是視覺語言模型 (VLM) 的重要應用。然而,由於多元化文字豐富的視覺語言資料稀少,VLM 在這些領域中經常會遇到困難。為了應對這個挑戰,我們提出了 CoSyn,一個利用純文字大型語言模型 (LLM) 的編碼能力來自動建立合成文字豐富多模態資料的架構。給定描述目標網域的輸入文字(例如「營養成分標籤」),CoSyn 會提示 LLM 產生用於合成影像渲染的程式碼(Python、HTML、LaTeX 等)。透過將底層程式碼作為合成影像的文字表示,CoSyn 可以產生高品質的指令調整資料,再次依賴純文字 LLM。使用 CoSyn,我們建構了一個包含 40 萬張影像和 270 萬列視覺語言指令調整資料的資料集。在七個基準上的全面實驗證明,在我們的合成資料上訓練的模型在競爭對手的開源模型(包括 Llama 3.2)中達到了最先進的效能,並超越了 GPT-4V 和 Gemini 1.5 Flash 等專有模型。此外,CoSyn 可以產生合成指向資料,讓 VLM 能在輸入影像中建立資訊基礎,展示其在開發能夠在真實世界環境中運作的多模態代理方面的潛力。 +摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 -##### **Revealing and Mitigating Over-Attention in Knowledge Editing** -2502.14838v1 by Pinzheng Wang, Zecheng Tang, Keyan Zhou, Juntao Li, Qiaoming Zhu, Min Zhang +##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** +2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker -Large Language Models have demonstrated superior performance across a wide -range of tasks, but they still exhibit undesirable errors due to incorrect -knowledge learned from the training data. To avoid this, knowledge editing -methods emerged to precisely edit the specific model knowledge via efficiently -modifying a very small percentage of parameters. % However, those methods can -lead to the problem of Specificity Failure: when the content related to the -edited knowledge occurs in the context, it can inadvertently corrupt other -pre-existing knowledge. However, those methods can lead to the problem of -Specificity Failure, where the existing knowledge and capabilities are severely -degraded due to editing. Our preliminary indicates that Specificity Failure -primarily stems from the model's attention heads assigning excessive attention -scores to entities related to the edited knowledge, thereby unduly focusing on -specific snippets within the context, which we denote as the Attention Drift -phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet -effective method Selective Attention Drift Restriction}(SADR), which introduces -an additional regularization term during the knowledge editing process to -restrict changes in the attention weight distribution, thereby preventing undue -focus on the edited entity. Experiments on five frequently used strong LLMs -demonstrate the effectiveness of our method, where SADR can significantly -mitigate Specificity Failure in the predominant knowledge editing tasks. +Explainability remains a significant problem for AI models in medical +imaging, making it challenging for clinicians to trust AI-driven predictions. +We introduce 3D ReX, the first causality-based post-hoc explainability tool for +3D models. 3D ReX uses the theory of actual causality to generate +responsibility maps which highlight the regions most crucial to the model's +decision. We test 3D ReX on a stroke detection model, providing insight into +the spatial distribution of features relevant to stroke. -摘要:大型語言模型已在廣泛任務中展現出卓越的效能,但由於從訓練資料中學習到不正確的知識,它們仍會出現令人不滿意的錯誤。為避免此情況,知識編輯方法應運而生,透過有效修改極少數參數來精準編輯特定模型知識。% 然而,這些方法可能會導致特異性失敗問題:當與已編輯知識相關的內容出現在文中時,可能會無意間損害其他既有知識。然而,這些方法可能會導致特異性失敗問題,因為現有知識和能力會因編輯而嚴重降低。我們的初步研究表明,特異性失敗主要源於模型的注意力權重將過度注意力分數分配給與已編輯知識相關的實體,從而過度關注文中特定的片段,我們將此現象稱為注意力偏移。為減輕這種注意力偏移問題,我們引入了一個簡單但有效的方法選擇性注意力偏移限制}(SADR),在知識編輯過程中引入一個額外的正則化項來限制注意力權重分配的變動,從而防止過度關注已編輯實體。在五個經常使用的強大 LLM 上進行的實驗證明了我們方法的有效性,其中 SADR 可以顯著減輕主要知識編輯任務中的特異性失敗。 +摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 +我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 -##### **Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs** -2502.14837v1 by Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui +##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** +2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano -Multi-head Latent Attention (MLA) is an innovative architecture proposed by -DeepSeek, designed to ensure efficient and economical inference by -significantly compressing the Key-Value (KV) cache into a latent vector. -Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its -variants such as Grouped-Query Attention (GQA) exhibit significant cost -disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA -without pre-training from scratch is both meaningful and challenging. This -paper proposes the first data-efficient fine-tuning method for transitioning -from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, -we remove RoPE from dimensions of queries and keys that contribute less to the -attention scores, for low-rank approximation, we introduce joint SVD -approximations based on the pre-trained parameters of keys and values. These -carefully designed strategies enable MHA2MLA to recover performance using only -a small fraction (0.3% to 0.6%) of the data, significantly reducing inference -costs while seamlessly integrating with compression techniques such as KV cache -quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, -with only a 0.5% drop in LongBench performance. +This paper presents a complete explainable system that interprets a set of +data, abstracts the underlying features and describes them in a natural +language of choice. The system relies on two crucial stages: (i) identifying +emerging properties from data and transforming them into abstract concepts, and +(ii) converting these concepts into natural language. Despite the impressive +natural language generation capabilities demonstrated by Large Language Models, +their statistical nature and the intricacy of their internal mechanism still +force us to employ these techniques as black boxes, forgoing trustworthiness. +Developing an explainable pipeline for data interpretation would allow +facilitating its use in safety-critical environments like processing medical +information and allowing non-experts and visually impaired people to access +narrated information. To this end, we believe that the fields of knowledge +representation and automated reasoning research could present a valid +alternative. Expanding on prior research that tackled the first stage (i), we +focus on the second stage, named Concept2Text. Being explainable, data +translation is easily modeled through logic-based rules, once again emphasizing +the role of declarative programming in achieving AI explainability. This paper +explores a Prolog/CLP-based rewriting system to interpret concepts-articulated +in terms of classes and relations, plus common knowledge-derived from a generic +ontology, generating natural language text. Its main features include +hierarchical tree rewritings, modular multilingual generation, support for +equivalent variants across semantic, grammar, and lexical levels, and a +transparent rule-based system. We outline the architecture and demonstrate its +flexibility through some examples capable of generating numerous diverse and +equivalent rewritings based on the input concept. -摘要:多頭潛在注意力 (MLA) 是 DeepSeek 提出的一種創新架構,旨在通過將鍵值 (KV) 快取大幅壓縮成潛在向量,確保有效率且經濟的推論。與 MLA 相比,採用多頭注意力 (MHA) 及其變體(例如分組查詢注意力 (GQA))的標準 LLM 會出現顯著的成本劣勢。讓訓練完善的 LLM(例如 Llama)能夠快速適應 MLA,而無需從頭開始預訓練,這既有意義又具有挑戰性。本文提出了第一個資料有效微調方法,用於從 MHA 轉換到 MLA (MHA2MLA),其中包含兩個關鍵組成部分:對於部分 RoPE,我們從查詢和鍵的維度中移除對注意力分數貢獻較小的 RoPE,對於低秩近似,我們基於鍵和值的預訓練參數引入聯合 SVD 近似。這些經過仔細設計的策略讓 MHA2MLA 能夠僅使用一小部分資料 (0.3% 至 0.6%) 來恢復效能,大幅降低推論成本,同時與壓縮技術(例如 KV 快取量化)無縫整合。例如,Llama2-7B 的 KV 快取大小減少了 92.19%,而 LongBench 效能僅下降了 0.5%。 +摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 -##### **LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models** -2502.14834v1 by Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li +##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** +2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek -Existing Large Vision-Language Models (LVLMs) can process inputs with context -lengths up to 128k visual and text tokens, yet they struggle to generate -coherent outputs beyond 1,000 words. We find that the primary limitation is the -absence of long output examples during supervised fine-tuning (SFT). To tackle -this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 -examples, each with multiple input images, an instruction, and corresponding -outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that -maintain high-fidelity to the input images, we employ Direct Preference -Optimization (DPO) to the SFT model. Given the high cost of collecting human -feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which -breaks long outputs into segments and uses iterative corrections to form -preference pairs with the original outputs. Additionally, we develop -MMLongBench-Write, a benchmark featuring six tasks to evaluate the -long-generation capabilities of VLMs. Our 7B parameter model, trained with -LongWriter-V-22k and IterDPO, achieves impressive performance on this -benchmark, outperforming larger proprietary models like GPT-4o. Code and data: -https://github.com/THU-KEG/LongWriter-V +We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), +an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS +predicts future PHTs using transformer-based architectures. The Adaptive Risk +Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk +probabilities for clinician-defined critical events. ARES incorporates a +personalized explainability module that identifies key clinical factors +influencing risk estimates for individual patients. ARES was evaluated on the +MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its +performance against traditional early warning systems and machine learning +models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, +with 60% including hospital admissions. The dataset contained over 357 million +tokens. ETHOS outperformed benchmark models in predicting hospital admissions, +ICU admissions, and prolonged hospital stays, achieving superior AUC scores. +ETHOS-based risk estimates demonstrated robustness across demographic subgroups +with strong model reliability, confirmed via calibration curves. The +personalized explainability module provides insights into patient-specific +factors contributing to risk. ARES, powered by ETHOS, advances predictive +healthcare AI by providing dynamic, real-time, and personalized risk estimation +with patient-specific explainability to enhance clinician trust. Its +adaptability and superior accuracy position it as a transformative tool for +clinical decision-making, potentially improving patient outcomes and resource +allocation in emergency and inpatient settings. We release the full code at +github.com/ipolharvard/ethos-ares to facilitate future research. -摘要:現有的大型視覺語言模型 (LVLMs) 能處理長度達 128k 視覺和文字符號的輸入內容,但卻難以產生超過 1,000 字的連貫輸出。我們發現,主要限制在於監督微調 (SFT) 期間缺少長輸出範例。為了解決此問題,我們引入了 LongWriter-V-22k,這是一個 SFT 資料集,包含 22,158 個範例,每個範例都有多個輸入影像、一個說明和對應的輸出,範圍從 0 到 10,000 字。此外,為了產生與輸入影像高度保真的長輸出,我們對 SFT 模型採用直接偏好最佳化 (DPO)。考量到收集人類回饋的成本很高(例如 3,000 字),我們提出 IterDPO,它會將長輸出區分成幾個區塊,並使用反覆修正來形成與原始輸出的偏好配對。此外,我們開發了 MMLongBench-Write,這是一個基準,包含六項任務,用於評估 VLM 的長生成能力。我們的 7B 參數模型使用 LongWriter-V-22k 和 IterDPO 進行訓練,在這個基準上取得令人印象深刻的效能,超越了 GPT-4o 等大型專有模型。程式碼和資料:https://github.com/THU-KEG/LongWriter-V +摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), +一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS +使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 -##### **Improving the Diffusability of Autoencoders** -2502.14831v1 by Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin +##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases** +2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq -Latent diffusion models have emerged as the leading approach for generating -high-quality images and videos, utilizing compressed latent representations to -reduce the computational burden of the diffusion process. While recent -advancements have primarily focused on scaling diffusion backbones and -improving autoencoder reconstruction quality, the interaction between these -components has received comparatively less attention. In this work, we perform -a spectral analysis of modern autoencoders and identify inordinate -high-frequency components in their latent spaces, which are especially -pronounced in the autoencoders with a large bottleneck channel size. We -hypothesize that this high-frequency component interferes with the -coarse-to-fine nature of the diffusion synthesis process and hinders the -generation quality. To mitigate the issue, we propose scale equivariance: a -simple regularization strategy that aligns latent and RGB spaces across -frequencies by enforcing scale equivariance in the decoder. It requires minimal -code changes and only up to 20K autoencoder fine-tuning steps, yet -significantly improves generation quality, reducing FID by 19% for image -generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation -on Kinetics-700 17x256x256. +This study addresses a critical gap in the healthcare system by developing a +clinically meaningful, practical, and explainable disease surveillance system +for multiple chronic diseases, utilizing routine EHR data from multiple U.S. +practices integrated with CureMD's EMR/EHR system. Unlike traditional +systems--using AI models that rely on features from patients' labs--our +approach focuses on routinely available data, such as medical history, vitals, +diagnoses, and medications, to preemptively assess the risks of chronic +diseases in the next year. We trained three distinct models for each chronic +disease: prediction models that forecast the risk of a disease 3, 6, and 12 +months before a potential diagnosis. We developed Random Forest models, which +were internally validated using F1 scores and AUROC as performance metrics and +further evaluated by a panel of expert physicians for clinical relevance based +on inferences grounded in medical knowledge. Additionally, we discuss our +implementation of integrating these models into a practical EMR system. Beyond +using Shapley attributes and surrogate models for explainability, we also +introduce a new rule-engineering framework to enhance the intrinsic +explainability of Random Forests. -摘要:潛在擴散模型已成為生成高品質影像和影片的主流方法,利用壓縮潛在表示來降低擴散過程的計算負擔。雖然近期的進展主要集中在擴充擴散主幹並提升自編碼器重建品質,但這些組成之間的交互作用卻鮮少受到關注。在這項研究中,我們對現代自編碼器進行頻譜分析,並在它們的潛在空間中找出不適當的高頻率組成,這在瓶頸通道尺寸較大的自編碼器中特別明顯。我們假設這種高頻率組成會干擾擴散合成過程由粗到細的性質,並阻礙生成品質。為了緩解這個問題,我們提出規模等變性:一種簡單的正則化策略,透過在解碼器中強制執行規模等變性,使潛在空間和 RGB 空間在各個頻率中保持一致。它只需要最小的程式碼變更,且僅需最多 20K 個自編碼器微調步驟,就能顯著提升生成品質,將 ImageNet-1K 256x256 上的影像生成的 FID 降低 19%,並將 Kinetics-700 17x256x256 上的影片生成的 FVD 降低至少 44%。 +摘要:本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統,來解決醫療保健系統中的重大缺口,利用整合 CureMD 的 EMR/EHR 系統,來自多個美國實務的例行 EHR 資料。與傳統系統不同的是,我們的做法著重在例行可得的資料,例如病歷、生命徵象、診斷和藥物,以預先評估未來一年慢性疾病的風險,而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型:預測模型,用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型,並使用 F1 分數和 AUROC 作為效能指標,進行內部驗證,並進一步由專家醫師小組根據植基於醫學知識的推論,評估其臨床相關性。此外,我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外,我們還引進了一個新的規則工程架構,以增強隨機森林的內在可解釋性。 -##### **Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs** -2502.14830v1 by Danni Liu, Jan Niehues +##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data** +2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek -While large language models demonstrate remarkable capabilities at -task-specific applications through fine-tuning, extending these benefits across -diverse languages is essential for broad accessibility. However, effective -cross-lingual transfer is hindered by LLM performance gaps across languages and -the scarcity of fine-tuning data in many languages. Through analysis of LLM -internal representations from over 1,000+ language pairs, we discover that -middle layers exhibit the strongest potential for cross-lingual alignment. -Building on this finding, we propose a middle-layer alignment objective -integrated into task-specific training. Our experiments on slot filling, -machine translation, and structured text generation show consistent -improvements in cross-lingual transfer, especially to lower-resource languages. -The method is robust to the choice of alignment languages and generalizes to -languages unseen during alignment. Furthermore, we show that separately trained -alignment modules can be merged with existing task-specific modules, improving -cross-lingual capabilities without full re-training. Our code is publicly -available (https://github.com/dannigt/mid-align). +Deep neural networks are increasingly employed in high-stakes medical +applications, despite their tendency for shortcut learning in the presence of +spurious correlations, which can have potentially fatal consequences in +practice. Detecting and mitigating shortcut behavior is a challenging task that +often requires significant labeling efforts from domain experts. To alleviate +this problem, we introduce a semi-automated framework for the identification of +spurious behavior from both data and model perspective by leveraging insights +from eXplainable Artificial Intelligence (XAI). This allows the retrieval of +spurious data points and the detection of model circuits that encode the +associated prediction rules. Moreover, we demonstrate how these shortcut +encodings can be used for XAI-based sample- and pixel-level data annotation, +providing valuable information for bias mitigation methods to unlearn the +undesired shortcut behavior. We show the applicability of our framework using +four medical datasets across two modalities, featuring controlled and +real-world spurious correlations caused by data artifacts. We successfully +identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision +Transformer models, ultimately increasing their robustness and applicability +for real-world medical tasks. -摘要:儘管大型語言模型在特定任務應用中透過微調展現出卓越的能力,但要讓這些好處擴及各種語言,對於廣泛的可及性來說至關重要。然而,有效的跨語言轉移受到跨語言 LLM 效能差距以及許多語言中微調資料的稀少性所阻礙。透過分析來自 1,000 多種語言對的 LLM 內部表示,我們發現中間層展現出最強的跨語言對齊潛力。根據這個發現,我們提出一個整合到特定任務訓練中的中間層對齊目標。我們在插槽填補、機器翻譯和結構化文字生成方面的實驗顯示,跨語言轉移持續改善,特別是對於低資源語言。此方法對於對齊語言的選擇具有穩健性,並推廣到對齊期間未曾見過的語言。此外,我們展示了單獨訓練的對齊模組可以與現有的特定任務模組合併,在不重新訓練的情況下改善跨語言能力。我們的程式碼已公開(https://github.com/dannigt/mid-align)。 +摘要:深度神经网络越来越多地用于高风险医疗应用中,尽管它们在存在虚假相关性的情况下倾向于捷径学习,这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务,通常需要领域专家的大量标记工作。为了缓解这个问题,我们引入了一个半自动框架,用于从数据和模型的角度识别虚假行为,方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外,我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释,为偏差缓解方法提供有价值的信息,以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性,这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差,最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。 -##### **Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps** -2502.14829v1 by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov +##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model** +2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail -When prompted to think step-by-step, language models (LMs) produce a chain of -thought (CoT), a sequence of reasoning steps that the model supposedly used to -produce its prediction. However, despite much work on CoT prompting, it is -unclear if CoT reasoning is faithful to the models' parameteric beliefs. We -introduce a framework for measuring parametric faithfulness of generated -reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an -instance of this framework. FUR erases information contained in reasoning steps -from model parameters. We perform experiments unlearning CoTs of four LMs -prompted on four multi-choice question answering (MCQA) datasets. Our -experiments show that FUR is frequently able to change the underlying models' -prediction by unlearning key steps, indicating when a CoT is parametrically -faithful. Further analysis shows that CoTs generated by models post-unlearning -support different answers, hinting at a deeper effect of unlearning. -Importantly, CoT steps identified as important by FUR do not align well with -human notions of plausbility, emphasizing the need for specialized alignment +Suicidal ideation detection is crucial for preventing suicides, a leading +cause of death worldwide. Many individuals express suicidal thoughts on social +media, offering a vital opportunity for early detection through advanced +machine learning techniques. The identification of suicidal ideation in social +media text is improved by utilising a hybrid framework that integrates +Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory +(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability +of the model's predictions, Explainable AI (XAI) methods are applied, with a +particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At +first, the model managed to reach an accuracy of 92.81%. By applying +fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The +SHAP analysis revealed key features influencing the model's predictions, such +as terms related to mental health struggles. This level of transparency boosts +the model's credibility while helping mental health professionals understand +and trust the predictions. This work highlights the potential for improving the +accuracy and interpretability of detecting suicidal tendencies, making a +valuable contribution to the progress of mental health monitoring systems. It +emphasizes the significance of blending powerful machine learning methods with +explainability to develop reliable and impactful mental health solutions. -摘要:当提示逐步思考时,语言模型 (LM) 会产生一系列思考 (CoT),这是模型用来产生预测的一系列推理步骤。然而,尽管在 CoT 提示上做了很多工作,但尚不清楚 CoT 推理是否符合模型的参数化信念。我们引入了一个框架来衡量生成推理的参数化保真度,并提出了通过取消学习推理步骤 (FUR) 的保真度,这是该框架的一个实例。FUR 从模型参数中擦除推理步骤中包含的信息。我们执行实验,取消学习提示在四个多项选择问答 (MCQA) 数据集上的四个 LM 的 CoT。我们的实验表明,FUR 经常能够通过取消学习关键步骤来改变底层模型的预测,表明 CoT 在参数上是保真的。进一步的分析表明,模型在取消学习后生成的 CoT 支持不同的答案,暗示取消学习具有更深层次的影响。重要的是,FUR 确定的 CoT 步骤与人类对合理性的概念不太一致,强调了专门对齐的必要性 +摘要:自殺意念偵測對於預防自殺至關重要,而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭,這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構,並加入注意力機制,可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性,我們採用可解釋人工智慧 (XAI) 方法,特別著重於 SHapley 加法解釋 (SHAP)。一開始,模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術,準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵,例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度,同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力,為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。 -##### **Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison** -2502.14827v1 by Aiswarya Baby, Tintu Thankom Koshy +##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights** +2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet -Visual Question Answering (VQA) has emerged as a pivotal task in the -intersection of computer vision and natural language processing, requiring -models to understand and reason about visual content in response to natural -language questions. Analyzing VQA datasets is essential for developing robust -models that can handle the complexities of multimodal reasoning. Several -approaches have been developed to examine these datasets, each offering -distinct perspectives on question diversity, answer distribution, and -visual-textual correlations. Despite significant progress, existing VQA models -face challenges related to dataset bias, limited model complexity, commonsense -reasoning gaps, rigid evaluation methods, and generalization to real world -scenarios. This paper presents a comprehensive comparative study of five -advanced VQA models: ABC-CNN, KICNLE, Masked Vision and Language Modeling, -BLIP-2, and OFA, each employing distinct methodologies to address these -challenges. +In epidemiology, traditional statistical methods such as logistic regression, +linear regression, and other parametric models are commonly employed to +investigate associations between predictors and health outcomes. However, +non-parametric machine learning techniques, such as deep neural networks +(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for +this task. Despite their potential, these methods face challenges due to the +limited availability of high-quality, high-quantity data in this field. To +address these challenges, we introduce SEANN, a novel approach for informed +DNNs that leverages a prevalent form of domain-specific knowledge: Pooled +Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies, +in different forms, and represent a quantitative form of a scientific +consensus. By direct integration within the learning procedure using a custom +loss, we experimentally demonstrate significant improvements in the +generalizability of predictive performances and the scientific plausibility of +extracted relationships compared to a domain-knowledge agnostic neural network +in a scarce and noisy data setting. -摘要:視覺問答 (VQA) 已成為電腦視覺與自然語言處理交會中的關鍵任務,要求模型理解和推理視覺內容以回應自然語言問題。分析 VQA 資料集對於開發健全的模型至關重要,這些模型能夠處理多模態推理的複雜性。已經開發出多種方法來檢驗這些資料集,每種方法都提供有關問題多樣性、答案分佈和視覺文本關聯性的不同觀點。儘管有顯著進展,現有的 VQA 模型仍面臨與資料集偏差、模型複雜性有限、常識推理差距、僵化的評估方法和推廣到現實世界場景相關的挑戰。本文對五個先進的 VQA 模型進行了全面的比較研究:ABC-CNN、KICNLE、Masked Vision and Language Modeling、BLIP-2 和 OFA,每個模型都採用不同的方法來應對這些挑戰。 +摘要:在流行病學中,傳統的統計方法,例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而,非參數機器學習技術,例如深度神經網路 (DNN),結合可解釋的 AI (XAI) 工具,為這項任務提供了新的機會。儘管這些方法具有潛力,但由於該領域缺乏高品質、高數量資料,因此這些方法面臨挑戰。為了應對這些挑戰,我們引入了 SEANN,這是一種新穎的方法,用於獲取知識的 DNN,它利用了一種流行的領域特定知識形式:彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中,並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中,我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比,科學合理性的顯著提升,且是在稀少且有雜訊的資料設定中。 -##### **eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables** -2502.14820v1 by Luis Antonio Gutiérrez Guanilo, Mir Tafseer Nayeem, Cristian López, Davood Rafiei +##### **Artificial Intelligence-Driven Clinical Decision Support Systems** +2501.09628v2 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni -Large Language Models (LLMs) have demonstrated exceptional versatility across -diverse domains, yet their application in e-commerce remains underexplored due -to a lack of domain-specific datasets. To address this gap, we introduce -eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, -including detailed product attributes and user-specific queries. Leveraging -eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to -produce high-quality, attribute-specific product reviews from structured -tabular data. Fine-tuned models were rigorously evaluated using standard -Table2Text metrics, alongside correctness, faithfulness, and fluency -assessments. Our results demonstrate substantial improvements in generating -contextually accurate reviews, highlighting the transformative potential of -tailored datasets and fine-tuning methodologies in optimizing e-commerce -workflows. This work highlights the potential of LLMs in e-commerce workflows -and the essential role of domain-specific datasets in tailoring them to -industry-specific challenges. +As artificial intelligence (AI) becomes increasingly embedded in healthcare +delivery, this chapter explores the critical aspects of developing reliable and +ethical Clinical Decision Support Systems (CDSS). Beginning with the +fundamental transition from traditional statistical models to sophisticated +machine learning approaches, this work examines rigorous validation strategies +and performance assessment methods, including the crucial role of model +calibration and decision curve analysis. The chapter emphasizes that creating +trustworthy AI systems in healthcare requires more than just technical +accuracy; it demands careful consideration of fairness, explainability, and +privacy. The challenge of ensuring equitable healthcare delivery through AI is +stressed, discussing methods to identify and mitigate bias in clinical +predictive models. The chapter then delves into explainability as a cornerstone +of human-centered CDSS. This focus reflects the understanding that healthcare +professionals must not only trust AI recommendations but also comprehend their +underlying reasoning. The discussion advances in an analysis of privacy +vulnerabilities in medical AI systems, from data leakage in deep learning +models to sophisticated attacks against model explanations. The text explores +privacy-preservation strategies such as differential privacy and federated +learning, while acknowledging the inherent trade-offs between privacy +protection and model performance. This progression, from technical validation +to ethical considerations, reflects the multifaceted challenges of developing +AI systems that can be seamlessly and reliably integrated into daily clinical +practice while maintaining the highest standards of patient care and data +protection. -摘要:大型語言模型 (LLM) 在各種領域展現出非凡的多功能性,但由於缺乏特定領域的資料集,因此它們在電子商務中的應用仍未得到充分探索。為了解決這個差距,我們引入了 eC-Tab2Text,這是一個新穎的資料集,旨在捕捉電子商務的複雜性,包括詳細的產品屬性和使用者特定的查詢。利用 eC-Tab2Text,我們專注於從產品表格中產生文字,使 LLM 能夠從結構化的表格資料中產生高品質、特定屬性的產品評論。微調模型使用標準的 Table2Text 指標,以及正確性、忠實度和流利度評估進行嚴格評估。我們的結果證明在產生符合語境的準確評論方面有顯著的進步,突顯了客製化資料集和微調方法在最佳化電子商務工作流程中的轉型潛力。這項工作突顯了 LLM 在電子商務工作流程中的潛力,以及特定領域資料集在因應產業特定挑戰中至關重要的角色。 +摘要:隨著人工智慧(AI)在醫療保健服務中日益普及,本章探討了開發可靠且符合道德的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型轉變到複雜機器學習方法的基本原理開始,這項工作探討了嚴謹的驗證策略和效能評估方法,包括模型校準和決策曲線分析的關鍵角色。本章強調,在醫療保健中建立值得信賴的 AI 系統不僅需要技術準確性;它需要仔細考量公平性、可解釋性和隱私。本章強調了透過 AI 確保公平醫療保健服務的挑戰,並討論了識別和減輕臨床預測模型中偏差的方法。接著,本章深入探討可解釋性作為以人為中心的 CDSS 的基石。這種關注反映了對醫療保健專業人員不僅必須信任 AI 建議,還必須理解其背後推理的理解。討論進展到對醫療 AI 系統中隱私漏洞的分析,從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略,例如差分隱私和聯合學習,同時承認隱私保護和模型效能之間的固有權衡。從技術驗證到道德考量,這種進展反映了開發 AI 系統的多方面挑戰,這些系統可以無縫且可靠地整合到日常臨床實務中,同時維持最高標準的患者照護和資料保護。 -##### **Optimizing Model Selection for Compound AI Systems** -2502.14815v1 by Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica +##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis** +2501.06887v1 by Sadia Kamal, Tim Oates -Compound AI systems that combine multiple LLM calls, such as self-refine and -multi-agent-debate, achieve strong performance on many AI tasks. We address a -core question in optimizing compound systems: for each LLM call or module in -the system, how should one decide which LLM to use? We show that these LLM -choices have a large effect on quality, but the search space is exponential. We -propose LLMSelector, an efficient framework for model selection in compound -systems, which leverages two key empirical insights: (i) end-to-end performance -is often monotonic in how well each module performs, with all other modules -held fixed, and (ii) per-module performance can be estimated accurately by an -LLM. Building upon these insights, LLMSelector iteratively selects one module -and allocates to it the model with the highest module-wise performance, as -estimated by an LLM, until no further gain is possible. LLMSelector is -applicable to any compound system with a bounded number of modules, and its -number of API calls scales linearly with the number of modules, achieving -high-quality model allocation both empirically and theoretically. Experiments -with popular compound systems such as multi-agent debate and self-refine using -LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector -confers 5%-70% accuracy gains compared to using the same LLM for all modules. +As deep learning models gain attraction in medical data, ensuring transparent +and trustworthy decision-making is essential. In skin cancer diagnosis, while +advancements in lesion detection and classification have improved accuracy, the +black-box nature of these methods poses challenges in understanding their +decision processes, leading to trust issues among physicians. This study +leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on +different skin lesion datasets, to capture meaningful relationships between +visual features and diagnostic criteria terms. To further enhance transparency, +we propose a method called MedGrad E-CLIP, which builds on gradient-based +E-CLIP by incorporating a weighted entropy mechanism designed for complex +medical imaging like skin lesions. This approach highlights critical image +regions linked to specific diagnostic descriptions. The developed integrated +pipeline not only classifies skin lesions by matching corresponding +descriptions but also adds an essential layer of explainability developed +especially for medical data. By visually explaining how different features in +an image relates to diagnostic criteria, this approach demonstrates the +potential of advanced vision-language models in medical image analysis, +ultimately improving transparency, robustness, and trust in AI-driven +diagnostic systems. -摘要:複合式 AI 系統結合多個 LLM 呼叫,例如自我精煉和多代理辯論,在許多 AI 任務中都能獲得強大的效能。我們解決了最佳化複合式系統中的核心問題:對於系統中的每個 LLM 呼叫或模組,應該如何決定要使用哪個 LLM?我們表明這些 LLM 選擇對品質有很大的影響,但搜尋空間是呈指數增長的。我們提出 LLMSelector,一種用於複合式系統中模型選擇的有效架構,它利用了兩個主要的經驗見解:(i) 端對端效能通常會隨著每個模組執行得有多好而單調變化,而其他所有模組保持固定,以及 (ii) 每個模組的效能都可以由 LLM 精準估計。LLMSelector 建立在這些見解之上,反覆選擇一個模組,並根據 LLM 估計的模組最佳效能,將模型分配給它,直到無法再進一步提升為止。LLMSelector 適用於任何具有有限數量的模組的複合式系統,其 API 呼叫數量與模組數量成線性比例,在經驗和理論上都實現了高品質的模型配置。使用 GPT-4o、Claude 3.5 Sonnet 和 Gemini 1.5 等 LLM,對多代理辯論和自我精煉等熱門複合式系統進行的實驗表明,與對所有模組使用相同的 LLM 相比,LLMSelector 可帶來 5%-70% 的準確度提升。 +摘要:随着深度学习模型在医学数据中获得关注,确保透明且值得信赖的决策至关重要。在皮肤癌诊断中,虽然病灶检测和分类的进步提高了准确性,但这些方法的黑盒性质对理解其决策过程构成了挑战,导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP(对比语言图像预训练)模型,以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度,我们提出了一种名为 MedGrad E-CLIP 的方法,该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制,建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类,还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系,这种方法展示了高级视觉语言模型在医学图像分析中的潜力,最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。 -##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** -2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub +##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis** +2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat -Foundation models are becoming increasingly effective in the medical domain, -offering pre-trained models on large datasets that can be readily adapted for -downstream tasks. Despite progress, fetal ultrasound images remain a -challenging domain for foundation models due to their inherent complexity, -often requiring substantial additional training and facing limitations due to -the scarcity of paired multimodal data. To overcome these challenges, here we -introduce FetalCLIP, a vision-language foundation model capable of generating -universal representation of fetal ultrasound images. FetalCLIP was pre-trained -using a multimodal learning approach on a diverse dataset of 210,035 fetal -ultrasound images paired with text. This represents the largest paired dataset -of its kind used for foundation model development to date. This unique training -approach allows FetalCLIP to effectively learn the intricate anatomical -features present in fetal ultrasound images, resulting in robust -representations that can be used for a variety of downstream applications. In -extensive benchmarking across a range of key fetal ultrasound applications, -including classification, gestational age estimation, congenital heart defect -(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all -baselines while demonstrating remarkable generalizability and strong -performance even with limited labeled data. We plan to release the FetalCLIP -model publicly for the benefit of the broader scientific community. +Humour styles can have either a negative or a positive impact on well-being. +Given the importance of these styles to mental health, significant research has +been conducted on their automatic identification. However, the automated +machine learning models used for this purpose are black boxes, making their +prediction decisions opaque. Clarity and transparency are vital in the field of +mental health. This paper presents an explainable AI (XAI) framework for +understanding humour style classification, building upon previous work in +computational humour analysis. Using the best-performing single model +(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to +analyse how linguistic, emotional, and semantic features contribute to humour +style classification decisions. Our analysis reveals distinct patterns in how +different humour styles are characterised and misclassified, with particular +emphasis on the challenges in distinguishing affiliative humour from other +styles. Through detailed examination of feature importance, error patterns, and +misclassification cases, we identify key factors influencing model decisions, +including emotional ambiguity, context misinterpretation, and target +identification. The framework demonstrates significant utility in understanding +model behaviour, achieving interpretable insights into the complex interplay of +features that define different humour styles. Our findings contribute to both +the theoretical understanding of computational humour analysis and practical +applications in mental health, content moderation, and digital humanities +research. -摘要:基礎模型在醫療領域正變得越來越有效, -提供在大型資料集上預先訓練的模型,可輕鬆適應 -下游任務。儘管有進展,但胎兒超音波影像仍然是 -基礎模型的挑戰領域,因為它們固有的複雜性, -通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 -介紹 FetalCLIP,一種能夠產生 -胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 -超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 -方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 -表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 -(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 -效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 +摘要:幽默風格對幸福感可能產生負面或正面的影響。 +鑑於這些風格對心理健康的重要性,已經對其自動識別進行了大量研究。然而,用於此目的的自動機器學習模型是黑盒子,使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架,用於理解幽默風格分類,建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost),我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式,特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例,我們確定了影響模型決策的關鍵因素,包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用,實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。 + +##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support** +2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani + +The increasing demand for mental health services has highlighted the need for +innovative solutions, particularly in the realm of psychological conversational +AI, where the availability of sensitive data is scarce. In this work, we +explored the development of a system tailored for mental health support with a +novel approach to psychological assessment based on explainable emotional +profiles in combination with empathetic conversational models, offering a +promising tool for augmenting traditional care, particularly where immediate +expertise is unavailable. Our work can be divided into two main parts, +intrinsecaly connected to each other. First, we present RACLETTE, a +conversational system that demonstrates superior emotional accuracy compared to +state-of-the-art benchmarks in both understanding users' emotional states and +generating empathetic responses during conversations, while progressively +building an emotional profile of the user through their interactions. Second, +we show how the emotional profiles of a user can be used as interpretable +markers for mental health assessment. These profiles can be compared with +characteristic emotional patterns associated with different mental disorders, +providing a novel approach to preliminary screening and support. -##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** -2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su +摘要:隨著對心理健康服務需求的增加,凸顯了創新解決方案的需求,特別是在心理對話式人工智慧領域,那裡缺乏敏感資料。在這項工作中,我們探索了開發一個針對心理健康支持的系統,採用一種基於可解釋的情緒特徵的新方法進行心理評估,結合同理心對話模式,提供了一個有前途的工具,用於擴充傳統照護,特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分,彼此內在相關。首先,我們展示了 RACLETTE,一個對話系統,與最先進的基準相比,在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性,同時透過他們的互動逐漸建立使用者的情緒特徵。其次,我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較,提供了一種初步篩選和支持的新方法。 -Our ability to continuously acquire, organize, and leverage knowledge is a -key feature of human intelligence that AI systems must approximate to unlock -their full potential. Given the challenges in continual learning with large -language models (LLMs), retrieval-augmented generation (RAG) has become the -dominant way to introduce new information. However, its reliance on vector -retrieval hinders its ability to mimic the dynamic and interconnected nature of -human long-term memory. Recent RAG approaches augment vector embeddings with -various structures like knowledge graphs to address some of these gaps, namely -sense-making and associativity. However, their performance on more basic -factual memory tasks drops considerably below standard RAG. We address this -unintended deterioration and propose HippoRAG 2, a framework that outperforms -standard RAG comprehensively on factual, sense-making, and associative memory -tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in -HippoRAG and enhances it with deeper passage integration and more effective -online use of an LLM. This combination pushes this RAG system closer to the -effectiveness of human long-term memory, achieving a 7% improvement in -associative memory tasks over the state-of-the-art embedding model while also -exhibiting superior factual knowledge and sense-making memory capabilities. -This work paves the way for non-parametric continual learning for LLMs. Our -code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. +##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation** +2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia -摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 +Artificial intelligence (AI) has emerged as a powerful tool to enhance +decision-making and optimize treatment protocols in in vitro fertilization +(IVF). In particular, AI shows significant promise in supporting +decision-making during the ovarian stimulation phase of the IVF process. This +review evaluates studies focused on the applications of AI combined with +medical imaging in ovarian stimulation, examining methodologies, outcomes, and +current limitations. Our analysis of 13 studies on this topic reveals that, +reveal that while AI algorithms demonstrated notable potential in predicting +optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the +medical imaging data utilized predominantly came from two-dimensional (2D) +ultrasound which mainly involved basic quantifications, such as follicle size +and number, with limited use of direct feature extraction or advanced image +analysis techniques. This points to an underexplored opportunity where advanced +image analysis approaches, such as deep learning, and more diverse imaging +modalities, like three-dimensional (3D) ultrasound, could unlock deeper +insights. Additionally, the lack of explainable AI (XAI) in most studies raises +concerns about the transparency and traceability of AI-driven decisions - key +factors for clinical adoption and trust. Furthermore, many studies relied on +single-center designs and small datasets, which limit the generalizability of +their findings. This review highlights the need for integrating advanced +imaging analysis techniques with explainable AI methodologies, as well as the +importance of leveraging multicenter collaborations and larger datasets. +Addressing these gaps has the potential to enhance ovarian stimulation +management, paving the way for efficient, personalized, and data-driven +treatment pathways that improve IVF outcomes. -##### **A Survey on Text-Driven 360-Degree Panorama Generation** -2502.14799v1 by Hai Wang, Xiaoyu Xiang, Weihao Xia, Jing-Hao Xue +摘要:人工智慧(AI)已成為增強體外受精(IVF)決策制定和優化治療方案的強大工具。特別是,AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示,雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力,但所利用的醫學影像數據主要來自於二次元(2D)超音波,而二次元超音波主要涉及基本量化,例如濾泡大小和數量,且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會,例如深度學習等進階影像分析方法,以及更多元的影像模式,例如三維(3D)超音波,可以解鎖更深入的見解。此外,大多數研究缺乏可解釋 AI(XAI),這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂,而透明度和可追溯性是臨床採用和信任的關鍵因素。此外,許多研究依賴於單中心設計和小型數據集,這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性,以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理,為有效、個人化和數據驅動的治療途徑鋪平道路,進而改善 IVF 結果。 -The advent of text-driven 360-degree panorama generation, enabling the -synthesis of 360-degree panoramic images directly from textual descriptions, -marks a transformative advancement in immersive visual content creation. This -innovation significantly simplifies the traditionally complex process of -producing such content. Recent progress in text-to-image diffusion models has -accelerated the rapid development in this emerging field. This survey presents -a comprehensive review of text-driven 360-degree panorama generation, offering -an in-depth analysis of state-of-the-art algorithms and their expanding -applications in 360-degree 3D scene generation. Furthermore, we critically -examine current limitations and propose promising directions for future -research. A curated project page with relevant resources and research papers is -available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/. +##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models** +2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman -摘要:文字驅動 360 度全景圖生成技術的出現,使能從文字描述中直接合成 360 度全景圖像,標誌著沉浸式視覺內容創作的變革性進展。這項創新顯著簡化了傳統上複雜的製作此類內容的過程。最近在文字轉圖像擴散模型方面的進展加速了這個新興領域的快速發展。本調查提供了對文字驅動 360 度全景圖生成的全面回顧,深入分析了最先進的演算法及其在 360 度 3D 場景生成中的擴展應用。此外,我們批判性地審視了當前的限制,並提出了未來研究的有希望的方向。一個精選的專案頁面,其中包含相關資源和研究論文,可在 https://littlewhitesea.github.io/Text-Driven-Pano-Gen/ 獲得。 +This research presents an innovative approach to cancer diagnosis and +prediction using explainable Artificial Intelligence (XAI) and deep learning +techniques. With cancer causing nearly 10 million deaths globally in 2020, +early and accurate diagnosis is crucial. Traditional methods often face +challenges in cost, accuracy, and efficiency. Our study develops an AI model +that provides precise outcomes and clear insights into its decision-making +process, addressing the "black box" problem of deep learning models. By +employing XAI techniques, we enhance interpretability and transparency, +building trust among healthcare professionals and patients. Our approach +leverages neural networks to analyse extensive datasets, identifying patterns +for cancer detection. This model has the potential to revolutionise diagnosis +by improving accuracy, accessibility, and clarity in medical decision-making, +possibly leading to earlier detection and more personalised treatment +strategies. Furthermore, it could democratise access to high-quality +diagnostics, particularly in resource-limited settings, contributing to global +health equity. The model's applications extend beyond cancer diagnosis, +potentially transforming various aspects of medical decision-making and saving +millions of lives worldwide. -##### **Rapid Word Learning Through Meta In-Context Learning** -2502.14791v1 by Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake +摘要:本研究提出了一個創新的癌症診斷和預測方法,使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡,因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型,它提供精確的結果並清楚地了解其決策過程,解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術,我們增強了解釋性和透明度,在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集,識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷,可能導致更早的檢測和更個性化的治療策略。此外,它可以使更多人獲得高品質的診斷,特別是在資源有限的環境中,有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷,還可能轉變醫療決策的各個方面,並拯救全球數百萬人的生命。 -Humans can quickly learn a new word from a few illustrative examples, and -then systematically and flexibly use it in novel contexts. Yet the abilities of -current language models for few-shot word learning, and methods for improving -these abilities, are underexplored. In this study, we introduce a novel method, -Meta-training for IN-context learNing Of Words (Minnow). This method trains -language models to generate new examples of a word's usage given a few -in-context examples, using a special placeholder token to represent the new -word. This training is repeated on many new words to develop a general -word-learning ability. We find that training models from scratch with Minnow on -human-scale child-directed language enables strong few-shot word learning, -comparable to a large language model (LLM) pre-trained on orders of magnitude -more data. Furthermore, through discriminative and generative evaluations, we -demonstrate that finetuning pre-trained LLMs with Minnow improves their ability -to discriminate between new words, identify syntactic categories of new words, -and generate reasonable new usages and definitions for new words, based on one -or a few in-context examples. These findings highlight the data efficiency of -Minnow and its potential to improve language model performance in word learning -tasks. +##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG** +2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag -摘要:人類可以從幾個說明性的範例中快速學習一個新字詞,然後系統性且靈活地將其用於新的脈絡中。然而,目前語言模型在少量字詞學習中的能力,以及改善這些能力的方法,尚未得到充分探討。在這項研究中,我們引入了一種新方法,即「用於字詞情境學習的元訓練」(Minnow)。此方法訓練語言模型在給定幾個情境範例的情況下,產生字詞用法的範例,並使用特殊佔位符標記來表示新的字詞。此訓練會在許多新字詞上重複進行,以培養一般的字詞學習能力。我們發現,從頭開始使用 Minnow 在人類規模的兒童導向語言上訓練模型,可以實現強大的少量字詞學習能力,這與預先在大量資料上訓練的大型語言模型 (LLM) 相當。此外,透過區辨性和生成性評估,我們證明使用 Minnow 微調預先訓練的 LLM 可以提升其區辨新字詞、識別新字詞的句法類別,以及根據一個或幾個情境範例產生合理的新用法和定義的能力。這些發現突顯了 Minnow 的資料效率,以及它在字詞學習任務中提升語言模型效能的潛力。 +Deep learning has advanced medical image classification, but interpretability +challenges hinder its clinical adoption. This study enhances interpretability +in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs) +and a multi-agent Retrieval-Augmented Generation (RAG) system for report +generation. By modeling relationships between visual features and clinical +concepts, we create interpretable concept vectors that guide a multi-agent RAG +system to generate radiology reports, enhancing clinical relevance, +explainability, and transparency. Evaluation of the generated reports using an +LLM-as-a-judge confirmed the interpretability and clinical utility of our +model's outputs. On the COVID-QU dataset, our model achieved 81% classification +accuracy and demonstrated robust report generation performance, with five key +metrics ranging between 84% and 90%. This interpretable multi-agent framework +bridges the gap between high-performance AI and the explainability required for +reliable AI-driven CXR analysis in clinical settings. Our code is available at +https://github.com/tifat58/IRR-with-CBM-RAG.git. -##### **SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features** -2502.14786v1 by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai +摘要:深度學習已提升醫學影像分類,但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成,來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係,我們建立可解釋的概念向量,引導多代理 RAG 系統生成放射報告,增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估,確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上,我們的模型達到了 81% 的分類準確率,並展示了穩健的報告生成效能,五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。 -We introduce SigLIP 2, a family of new multilingual vision-language encoders -that build on the success of the original SigLIP. In this second iteration, we -extend the original image-text training objective with several prior, -independently developed techniques into a unified recipe -- this includes -captioning-based pretraining, self-supervised losses (self-distillation, masked -prediction) and online data curation. With these changes, SigLIP 2 models -outperform their SigLIP counterparts at all model scales in core capabilities, -including zero-shot classification, image-text retrieval, and transfer -performance when extracting visual representations for Vision-Language Models -(VLMs). Furthermore, the new training recipe leads to significant improvements -on localization and dense prediction tasks. We also train variants which -support multiple resolutions and preserve the input's native aspect ratio. -Finally, we train on a more diverse data-mixture that includes de-biasing -techniques, leading to much better multilingual understanding and improved -fairness. To allow users to trade off inference cost with performance, we -release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), -and g (1B). +##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models** +2412.15748v1 by Shamus Sim, Tyrone Chen -摘要:我們推出了 SigLIP 2,這是一個新的多語言視覺語言編碼器系列,它建立在 SigLIP 的成功基礎上。在這個第二個版本中,我們將原來的圖像文字訓練目標與幾個先前獨立開發的技術擴展到一個統一的配方中,其中包括基於標題的預訓練、自我監督損失(自我蒸餾、遮罩預測)和線上數據策展。有了這些改變,SigLIP 2 模型在所有模型規模上都超越了 SigLIP 的對應模型,包括零次分類、圖像文字檢索和在為視覺語言模型 (VLM) 提取視覺表示時傳輸效能。此外,新的訓練配方也大幅改善了定位和密集預測任務。我們還訓練了支援多種解析度和保留輸入原生長寬比的變體。最後,我們在一個更為多樣化的數據組合上進行訓練,其中包括去偏見技術,從而大幅提升多語言理解力並改善公平性。為了讓使用者權衡推理成本與效能,我們發布了四種大小的模型檢查點:ViT-B (86M)、L (303M)、So400m (400M) 和 g (1B)。 +Background: Despite the current ubiquity of Large Language Models (LLMs) +across the medical domain, there is a surprising lack of studies which address +their reasoning behaviour. We emphasise the importance of understanding +reasoning behaviour as opposed to high-level prediction accuracies, since it is +equivalent to explainable AI (XAI) in this context. In particular, achieving +XAI in medical LLMs used in the clinical domain will have a significant impact +across the healthcare sector. Results: Therefore, we define the concept of +reasoning behaviour in the specific context of medical LLMs. We then categorise +and discuss the current state of the art of methods which evaluate reasoning +behaviour in medical LLMs. Finally, we propose theoretical frameworks which can +empower medical professionals or machine learning engineers to gain insight +into the low-level reasoning operations of these previously obscure models. +Conclusion: The subsequent increased transparency and trust in medical machine +learning models by clinicians as well as patients will accelerate the +integration, application as well as further development of medical AI for the +healthcare system as a whole -##### **ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting** -2502.14780v1 by Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, Minji Kim +摘要:背景:儘管大型語言模型 (LLM) 目前在醫療領域無所不在,但令人驚訝的是,探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要,因為在這種情況下,這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI,將對整個醫療保健產業產生重大影響。結果:因此,我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後,我們提出理論架構,讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論:臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升,將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。 -Efficient and privacy-preserving multimodal interaction is essential as AR, -VR, and modern smartphones with powerful cameras become primary interfaces for -human-computer communication. Existing powerful large vision-language models -(VLMs) enabling multimodal interaction often rely on cloud-based processing, -raising significant concerns about (1) visual privacy by transmitting sensitive -vision data to servers, and (2) their limited real-time, on-device usability. -This paper explores Visual Instruction Rewriting, a novel approach that -transforms multimodal instructions into text-only commands, allowing seamless -integration of lightweight on-device instruction rewriter VLMs (250M -parameters) with existing conversational AI systems, enhancing vision data -privacy. To achieve this, we present a dataset of over 39,000 examples across -14 domains and develop a compact VLM, pretrained on image captioning datasets -and fine-tuned for instruction rewriting. Experimental results, evaluated -through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic -parsing analysis, demonstrate that even a quantized version of the model -(<500MB storage footprint) can achieve effective instruction rewriting, thus -enabling privacy-focused, multimodal AI applications. +##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media** +2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton -摘要:高效且重視隱私的多模態互動至關重要,因為 AR、VR 和配備強大相機的現代智慧型手機已成為人機溝通的主要介面。現有的強大大型視覺語言模型 (VLM) 能支援多模態互動,通常仰賴雲端處理,這引發了重大的疑慮,包括:(1) 將敏感的視覺資料傳輸至伺服器,會造成視覺隱私問題,以及 (2) 其有限的即時、裝置上可用性。本文探討視覺指令改寫,這是一種新穎的方法,可將多模態指令轉換為純文字指令,讓輕量級的裝置上指令改寫 VLM (250M 參數) 與現有的對話式 AI 系統無縫整合,進而強化視覺資料的隱私。為達成此目標,我們提供一個跨越 14 個領域、超過 39,000 個範例的資料集,並開發一個精簡的 VLM,在圖片標題資料集上進行預訓練,並針對指令改寫進行微調。實驗結果透過 NLG 指標(例如 BLEU、METEOR 和 ROUGE)以及語意解析分析進行評估,證明即使是模型的量化版本(<500MB 儲存空間佔用量)也能有效執行指令改寫,進而支援注重隱私的多模態 AI 應用程式。 +Stress is a pervasive global health issue that can lead to severe mental +health problems. Early detection offers timely intervention and prevention of +stress-related disorders. The current early detection models perform "black +box" inference suffering from limited explainability and trust which blocks the +real-world clinical application. Thanks to the generative properties introduced +by the Large Language Models (LLMs), the decision and the prediction from such +models are semi-interpretable through the corresponding description. However, +the existing LLMs are mostly trained for general purposes without the guidance +of psychological cognitive theory. To this end, we first highlight the +importance of prior theory with the observation of performance boosted by the +chain-of-thoughts tailored for stress detection. This method termed Cognition +Chain explicates the generation of stress through a step-by-step cognitive +perspective based on cognitive appraisal theory with a progress pipeline: +Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress +State, guiding LLMs to provide comprehensive reasoning explanations. We further +study the benefits brought by the proposed Cognition Chain format by utilising +it as a synthetic dataset generation template for LLMs instruction-tuning and +introduce CogInstruct, an instruction-tuning dataset for stress detection. This +dataset is developed using a three-stage self-reflective annotation pipeline +that enables LLMs to autonomously generate and refine instructional data. By +instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable +stress detection model. Evaluations demonstrate that CogLLM achieves +outstanding performance while enhancing explainability. Our work contributes a +novel approach by integrating cognitive theories into LLM reasoning processes, +offering a promising direction for future explainable AI research. -##### **Harnessing PDF Data for Improving Japanese Large Multimodal Models** -2502.14778v1 by Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa +摘要:壓力是一個普遍的全球性健康問題,可能會導致嚴重的精神 +健康問題。早期發現提供及時的干預和預防 +壓力相關疾病。目前的早期發現模型執行「黑 +盒子」推論,存在可解釋性和信任度有限的問題,阻礙了 +現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性,此類 +模型的決策和預測通過對應描述具有半可解釋性。然而, +現有的 LLM 主要針對一般用途進行訓練,沒有心理認知理論的指導。為此,我們首先強調 +先驗理論的重要性,並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知 +鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生,並具有進度管道: +刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力 +狀態,指導 LLM 提供全面的推理解釋。我們進一步 +通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點,並介紹 CogInstruct,這是一個針對壓力檢測的指令調整數據集。這個 +數據集是使用一個三階段的自省標註管道開發的,使 LLM 能夠自主生成和優化指令數據。通過 +使用 CogInstruct 對 Llama3 進行指令調整,我們開發了 CogLLM,這是一個可解釋的 +壓力檢測模型。評估表明,CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中,提出了一種新穎的方法, +為未來的可解釋人工智能研究提供了一個有希望的方向。 -Large Multimodal Models (LMMs) have demonstrated strong performance in -English, but their effectiveness in Japanese remains limited due to the lack of -high-quality training data. Current Japanese LMMs often rely on translated -English datasets, restricting their ability to capture Japan-specific cultural -knowledge. To address this, we explore the potential of Japanese PDF data as a -training resource, an area that remains largely underutilized. We introduce a -fully automated pipeline that leverages pretrained models to extract image-text -pairs from PDFs through layout analysis, OCR, and vision-language pairing, -removing the need for manual annotation. Additionally, we construct instruction -data from extracted image-text pairs to enrich the training data. To evaluate -the effectiveness of PDF-derived data, we train Japanese LMMs and assess their -performance on the Japanese LMM Benchmark. Our results demonstrate substantial -improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. -Further analysis highlights the impact of PDF-derived data on various factors, -such as model size and language models, reinforcing its value as a multimodal -resource for Japanese LMMs. We plan to make the source code and data publicly -available upon acceptance. +##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology** +2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi + +Human-machine teaming in medical AI requires us to understand to what degree +a trained clinician should weigh AI predictions. While previous work has shown +the potential of AI assistance at improving clinical predictions, existing +clinical decision support systems either provide no explainability of their +predictions or use techniques like saliency and Shapley values, which do not +allow for physician-based verification. To address this gap, this study +compares previously used explainable AI techniques with a newly proposed +technique termed '2-factor retrieval (2FR)', which is a combination of +interface design and search retrieval that returns similarly labeled data +without processing this data. This results in a 2-factor security blanket +where: (a) correct images need to be retrieved by the AI; and (b) humans should +associate the retrieved images with the current pathology under test. We find +that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician +accuracy, with particular improvements when clinicians are radiologists and +have low confidence in their decision. Our results highlight the importance of +understanding how different modes of human-AI decision making may impact +clinician accuracy in clinical decision support systems. -摘要:大型多模態模型 (LMM) 已在英語中表現出強勁的效能,但由於缺乏高品質的訓練資料,它們在日語中的效能仍然有限。目前的日語 LMM 通常依賴於翻譯後的英語資料集,限制了它們擷取特定於日本的文化知識的能力。為了解決這個問題,我們探索了日語 PDF 資料作為訓練資源的潛力,這個領域在很大程度上仍然未被充分利用。我們引入了一個全自動的管道,利用預先訓練好的模型透過版面分析、光學字元辨識和視覺語言配對從 PDF 中擷取影像文字對,消除了手動註解的需要。此外,我們從擷取的影像文字對中建構說明資料,以豐富訓練資料。為了評估 PDF 衍生資料的效能,我們訓練了日語 LMM,並在日語 LMM 基準上評估它們的效能。我們的結果證明了顯著的進步,在 Heron-Bench 上的效能提升幅度從 3.9% 到 13.8%。進一步的分析重點說明了 PDF 衍生資料對各種因素的影響,例如模型大小和語言模型,加強了其作為日語 LMM 的多模態資源的價值。我們計畫在接受後公開原始程式碼和資料。 +摘要:人機協作在醫療 AI 中,需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力,但現有的臨床決策支援系統,要不就沒有提供預測的可解釋性,要不就是使用像顯著性和 Shapley 值之類的技術,這些技術不允許基於醫生的驗證。為了解決這個差距,本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較,後者是一種介面設計和搜尋檢索的組合,它會傳回標籤相似的資料,而不會處理這些資料。這會產生一個 2 因子安全機制,其中:(a) 正確的影像需要由 AI 檢索;(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現,當在胸部 X 光診斷上進行測試時,2FR 會提高臨床醫生的準確度,特別是在臨床醫生是放射科醫生且對其決策信心不足時,會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。 -##### **Making Universal Policies Universal** -2502.14777v1 by Niklas Höpner, David Kuric, Herke van Hoof +##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance** +2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle -The development of a generalist agent capable of solving a wide range of -sequential decision-making tasks remains a significant challenge. We address -this problem in a cross-agent setup where agents share the same observation -space but differ in their action spaces. Our approach builds on the universal -policy framework, which decouples policy learning into two stages: a -diffusion-based planner that generates observation sequences and an inverse -dynamics model that assigns actions to these plans. We propose a method for -training the planner on a joint dataset composed of trajectories from all -agents. This method offers the benefit of positive transfer by pooling data -from different agents, while the primary challenge lies in adapting shared -plans to each agent's unique constraints. We evaluate our approach on the -BabyAI environment, covering tasks of varying complexity, and demonstrate -positive transfer across agents. Additionally, we examine the planner's -generalisation ability to unseen agents and compare our method to traditional -imitation learning approaches. By training on a pooled dataset from multiple -agents, our universal policy achieves an improvement of up to $42.20\%$ in task -completion accuracy compared to a policy trained on a dataset from a single -agent. +Understanding public perception of artificial intelligence (AI) and the +tradeoffs between potential risks and benefits is crucial, as these perceptions +might shape policy decisions, influence innovation trajectories for successful +market strategies, and determine individual and societal acceptance of AI +technologies. Using a representative sample of 1100 participants from Germany, +this study examines mental models of AI. Participants quantitatively evaluated +71 statements about AI's future capabilities (e.g., autonomous driving, medical +care, art, politics, warfare, and societal divides), assessing the expected +likelihood of occurrence, perceived risks, benefits, and overall value. We +present rankings of these projections alongside visual mappings illustrating +public risk-benefit tradeoffs. While many scenarios were deemed likely, +participants often associated them with high risks, limited benefits, and low +overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in +value assessment can be explained by perceived risks ($\beta=-.504$) and +perceived benefits ($\beta=+.710$), with no significant relation to expected +likelihood. Demographics and personality traits influenced perceptions of +risks, benefits, and overall evaluations, underscoring the importance of +increasing AI literacy and tailoring public information to diverse user needs. +These findings provide actionable insights for researchers, developers, and +policymakers by highlighting critical public concerns and individual factors +essential to align AI development with individual values. -摘要:開發一種能夠解決廣泛順序決策任務的通才代理仍然是一項重大挑戰。我們在跨代理設置中解決這個問題,其中代理共享相同的觀察空間,但在其動作空間中有所不同。我們的做法建立在通用策略框架之上,該框架將策略學習解耦為兩個階段:生成觀察序列的基於擴散的規劃器和將動作分配給這些計劃的逆動態模型。我們提出了一種在由所有代理的軌跡組成的聯合數據集上訓練規劃器的方法。這種方法提供了通過彙總來自不同代理的數據來進行正向傳輸的好處,而主要的挑戰在於將共享計劃適應於每個代理的唯一約束。我們在 BabyAI 環境中評估了我們的做法,涵蓋了不同複雜程度的任務,並展示了跨代理的正向傳輸。此外,我們檢查了規劃器對未見代理的概括能力,並將我們的做法與傳統的模仿學習方法進行了比較。通過在來自多個代理的彙總數據集上進行訓練,我們的通用策略在任務完成準確度方面實現了高達 42.20% 的改進,而從單個代理的數據集上訓練的策略。 +摘要:了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要,因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡,並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本,探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述(例如,自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧)進行了定量評估,評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名,並附上視覺化映射,說明了公眾的風險收益權衡。儘管許多場景被認為是可能的,但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中,96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋,與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法,這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素,為研究人員、開發人員和政策制定者提供了可行的見解。 -##### **SurveyX: Academic Survey Automation via Large Language Models** -2502.14776v1 by Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li +##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset** +2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey -Large Language Models (LLMs) have demonstrated exceptional comprehension -capabilities and a vast knowledge base, suggesting that LLMs can serve as -efficient tools for automated survey generation. However, recent research -related to automated survey generation remains constrained by some critical -limitations like finite context window, lack of in-depth content discussion, -and absence of systematic evaluation frameworks. Inspired by human writing -processes, we propose SurveyX, an efficient and organized system for automated -survey generation that decomposes the survey composing process into two phases: -the Preparation and Generation phases. By innovatively introducing online -reference retrieval, a pre-processing method called AttributeTree, and a -re-polishing process, SurveyX significantly enhances the efficacy of survey -composition. Experimental evaluation results show that SurveyX outperforms -existing automated survey generation systems in content quality (0.259 -improvement) and citation quality (1.76 enhancement), approaching human expert -performance across multiple evaluation dimensions. Examples of surveys -generated by SurveyX are available on www.surveyx.cn +The use of machine learning and AI on electronic health records (EHRs) holds +substantial potential for clinical insight. However, this approach faces +challenges due to data heterogeneity, sparsity, temporal misalignment, and +limited labeled outcomes. In this context, we leverage a linked EHR dataset of +approximately one million de-identified individuals from Bristol, North +Somerset, and South Gloucestershire, UK, to characterize urinary tract +infections (UTIs). We implemented a data pre-processing and curation pipeline +that transforms the raw EHR data into a structured format suitable for +developing predictive models focused on data fairness, accountability and +transparency. Given the limited availability and biases of ground truth UTI +outcomes, we introduce a UTI risk estimation framework informed by clinical +expertise to estimate UTI risk across individual patient timelines. Pairwise +XGBoost models are trained using this framework to differentiate UTI risk +categories with explainable AI techniques applied to identify key predictors +and support interpretability. Our findings reveal differences in clinical and +demographic predictors across risk groups. While this study highlights the +potential of AI-driven insights to support UTI clinical decision-making, +further investigation of patient sub-strata and extensive validation are needed +to ensure robustness and applicability in clinical practice. -摘要:大型語言模型 (LLM) 已展現出卓越的理解能力和廣泛的知識庫,表示 LLM 可作為自動調查生成的有用工具。然而,與自動調查生成相關的最新研究仍受到一些關鍵限制的約束,例如有限的上下文視窗、缺乏深入的內容討論以及系統評估架構的缺失。受到人類寫作過程的啟發,我們提出 SurveyX,這是一個用於自動調查生成的有效且有組織的系統,它將調查組成過程分解為兩個階段:準備和生成階段。透過創新地引入線上參考檢索、一種稱為 AttributeTree 的預處理方法和重新潤飾過程,SurveyX 大幅提升了調查組成的效能。實驗評估結果顯示,SurveyX 在內容品質(提升 0.259)和引用品質(提升 1.76)方面優於現有的自動調查生成系統,在多個評估面向中接近人類專家的表現。由 SurveyX 生成的調查範例可在 www.surveyx.cn 取得 +摘要:電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而,由於資料異質性、稀疏性、時間錯位和標籤結果有限,此方法面臨挑戰。在此背景下,我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集,來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線,適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差,我們引入了由臨床專業知識告知的 UTI 風險評估架構,以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練,以區分 UTI 風險類別,並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力,但仍需要進一步調查患者子群體和廣泛驗證,以確保在臨床實務中的穩健性和適用性。 -##### **Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning** -2502.14768v1 by Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo +##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care** +2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez -Inspired by the success of DeepSeek-R1, we explore the potential of -rule-based reinforcement learning (RL) in large reasoning models. To analyze -reasoning dynamics, we use synthetic logic puzzles as training data due to -their controllable complexity and straightforward answer verification. We make -some key technical contributions that lead to effective and stable RL training: -a system prompt that emphasizes the thinking and answering process, a stringent -format reward function that penalizes outputs for taking shortcuts, and a -straightforward training recipe that achieves stable convergence. Our 7B model -develops advanced reasoning skills-such as reflection, verification, and -summarization-that are absent from the logic corpus. Remarkably, after training -on just 5K logic problems, it demonstrates generalization abilities to the -challenging math benchmarks AIME and AMC. +There is a growing need to understand how digital systems can support +clinical decision-making, particularly as artificial intelligence (AI) models +become increasingly complex and less human-interpretable. This complexity +raises concerns about trustworthiness, impacting safe and effective adoption of +such technologies. Improved understanding of decision-making processes and +requirements for explanations coming from decision support tools is a vital +component in providing effective explainable solutions. This is particularly +relevant in the data-intensive, fast-paced environments of intensive care units +(ICUs). To explore these issues, group interviews were conducted with seven ICU +clinicians, representing various roles and experience levels. Thematic analysis +revealed three core themes: (T1) ICU decision-making relies on a wide range of +factors, (T2) the complexity of patient state is challenging for shared +decision-making, and (T3) requirements and capabilities of AI decision support +systems. We include design recommendations from clinical input, providing +insights to inform future AI systems for intensive care. -摘要:在 DeepSeek-R1 成功案例的启发下,我们探索了基于规则的强化学习 (RL) 在大型推理模型中的潜力。为了分析推理动态,我们使用合成逻辑难题作为训练数据,因为它们的可控复杂性和直接的答案验证。我们做出了一些关键的技术贡献,这些贡献导致了有效且稳定的 RL 训练:一个强调思考和回答过程的系统提示、一个严格的格式奖励函数,用于惩罚采取捷径的输出,以及一个实现稳定收敛的直接训练配方。我们的 7B 模型发展了高级推理技能,例如反射、验证和总结,这些技能在逻辑语料库中是不存在的。值得注意的是,在仅对 5K 个逻辑问题进行训练后,它展示了对具有挑战性的数学基准 AIME 和 AMC 的泛化能力。 +摘要:隨著人工智慧 (AI) 模型變得越來越複雜,且越來越難以被人理解,了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮,影響了此類技術的安全且有效採用。改善對決策制定流程的理解,以及對決策支援工具所提供說明的要求,是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題,對七位 ICU 臨床醫師進行了小組訪談,這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題:(T1) ICU 決策制定依賴於廣泛的因素,(T2) 病患狀態的複雜性對共同決策制定構成挑戰,以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議,提供見解以提供資訊給未來用於加護的 AI 系統。 -##### **Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis** -2502.14767v1 by Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han +##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning** +2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra -With the exponential growth of research facilitated by modern technology and -improved accessibility, scientific discoveries have become increasingly -fragmented within and across fields. This makes it challenging to assess the -significance, novelty, incremental findings, and equivalent ideas between -related works, particularly those from different research communities. Large -language models (LLMs) have recently demonstrated strong quantitative and -qualitative reasoning abilities, and multi-agent LLM debates have shown promise -in handling complex reasoning tasks by exploring diverse perspectives and -reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a -framework which converts scientific papers into LLM personas that debate their -respective novelties. To emphasize structured, critical reasoning rather than -focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling -fine-grained analysis of independent novelty arguments within scholarly -articles. Through experiments on scientific literature across various domains, -evaluated by expert researchers, we demonstrate that ToD generates informative -arguments, effectively contrasts papers, and supports researchers in their -literature review. +Pediatric heart diseases present a broad spectrum of congenital and acquired +diseases. More complex congenital malformations require a differentiated and +multimodal decision-making process, usually including echocardiography as a +central imaging method. Artificial intelligence (AI) offers considerable +promise for clinicians by facilitating automated interpretation of pediatric +echocardiography data. However, adapting AI technologies for pediatric +echocardiography analysis has challenges such as limited public data +availability, data privacy, and AI model transparency. Recently, researchers +have focused on disruptive technologies, such as federated learning (FL) and +explainable AI (XAI), to improve automatic diagnostic and decision support +workflows. This study offers a comprehensive overview of the limitations and +opportunities of AI in pediatric echocardiography, emphasizing the synergistic +workflow and role of XAI and FL, identifying research gaps, and exploring +potential future developments. Additionally, three relevant clinical use cases +demonstrate the functionality of XAI and FL with a focus on (i) view +recognition, (ii) disease classification, (iii) segmentation of cardiac +structures, and (iv) quantitative assessment of cardiac function. -摘要:隨著現代科技促進的研究呈指數成長,加上可近性的提升,科學發現已在各領域內外變得越來越分散。這使得評估相關作品之間的重要性、新穎性、漸進式發現和等價概念變得具有挑戰性,特別是來自不同研究社群的作品。大型語言模型 (LLM) 近期已展現出強大的量化和質化推理能力,而多重代理 LLM 辯論已在處理複雜推理任務方面展現出潛力,方法是探索不同的觀點和推理路徑。受到此啟發,我們引入了辯論樹 (ToD),這是一個將科學論文轉換為 LLM 人格的架構,這些人格會辯論各自的新穎性。為了強調結構化、批判性推理,而非僅專注於結果,ToD 會動態建構一個辯論樹,讓使用者能夠深入分析學術文章中獨立的新穎性論點。透過在不同領域的科學文獻上進行實驗,並由專家研究員進行評估,我們證明了 ToD 能產生有見地的論點、有效對比論文,並在研究人員的文獻回顧中提供協助。 +摘要:小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程,通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望,因為它可以促進小兒超音波檢查資料的自動化解讀。然而,將人工智慧技術應用於小兒超音波檢查分析有許多挑戰,例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近,研究人員專注於破壞性技術,例如聯合學習 (FL) 和可解釋人工智慧 (XAI),以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述,強調了 XAI 和 FL 的協同工作流程和角色,找出研究差距並探討潛在的未來發展。此外,三個相關的臨床使用案例展示了 XAI 和 FL 的功能,重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。 -##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** -2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes +##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering** +2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust -Fact verification (FV) aims to assess the veracity of a claim based on -relevant evidence. The traditional approach for automated FV includes a -three-part pipeline relying on short evidence snippets and encoder-only -inference models. More recent approaches leverage the multi-turn nature of LLMs -to address FV as a step-by-step problem where questions inquiring additional -context are generated and answered until there is enough information to make a -decision. This iterative method makes the verification process rational and -explainable. While these methods have been tested for encyclopedic claims, -exploration on domain-specific and realistic claims is missing. In this work, -we apply an iterative FV system on three medical fact-checking datasets and -evaluate it with multiple settings, including different LLMs, external web -search, and structured reasoning using logic predicates. We demonstrate -improvements in the final performance over traditional approaches and the high -potential of step-by-step FV systems for domain-specific claims. +Osteoporosis is a common condition that increases fracture risk, especially +in older adults. Early diagnosis is vital for preventing fractures, reducing +treatment costs, and preserving mobility. However, healthcare providers face +challenges like limited labeled data and difficulties in processing medical +images. This study presents a novel multi-modal learning framework that +integrates clinical and imaging data to improve diagnostic accuracy and model +interpretability. The model utilizes three pre-trained networks-VGG19, +InceptionV3, and ResNet50-to extract deep features from X-ray images. These +features are transformed using PCA to reduce dimensionality and focus on the +most relevant components. A clustering-based selection process identifies the +most representative components, which are then combined with preprocessed +clinical data and processed through a fully connected network (FCN) for final +classification. A feature importance plot highlights key variables, showing +that Medical History, BMI, and Height were the main contributors, emphasizing +the significance of patient-specific data. While imaging features were +valuable, they had lower importance, indicating that clinical data are crucial +for accurate predictions. This framework promotes precise and interpretable +predictions, enhancing transparency and building trust in AI-driven diagnoses +for clinical integration. -摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 +摘要:骨質疏鬆症是一種常見的疾病,會增加骨折的風險,特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而,醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架,該框架整合了臨床和影像數據,以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路,VGG19、InceptionV3 和 ResNet50,從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分,然後將這些組成部分與預處理的臨床數據結合,並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數,表明病史、BMI 和身高是主要貢獻因素,強調了患者特定數據的重要性。雖然影像特徵很有價值,但它們的重要性較低,這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測,提高了透明度,並建立了對 AI 驅動診斷在臨床整合中的信任。 -##### **EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations** -2502.14760v1 by Haotian Zhai, Connor Lawless, Ellen Vitercik, Liu Leqi +##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection** +2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor -A fundamental problem in combinatorial optimization is identifying equivalent -formulations, which can lead to more efficient solution strategies and deeper -insights into a problem's computational complexity. The need to automatically -identify equivalence between problem formulations has grown as optimization -copilots--systems that generate problem formulations from natural language -descriptions--have proliferated. However, existing approaches to checking -formulation equivalence lack grounding, relying on simple heuristics which are -insufficient for rigorous validation. Inspired by Karp reductions, in this work -we introduce quasi-Karp equivalence, a formal criterion for determining when -two optimization formulations are equivalent based on the existence of a -mapping between their decision variables. We propose EquivaMap, a framework -that leverages large language models to automatically discover such mappings, -enabling scalable and reliable equivalence verification. To evaluate our -approach, we construct the first open-source dataset of equivalent optimization -formulations, generated by applying transformations such as adding slack -variables or valid inequalities to existing formulations. Empirically, -EquivaMap significantly outperforms existing methods, achieving substantial -improvements in correctly identifying formulation equivalence. +This review paper explores recent advances in deep learning approaches for +non-invasive cognitive impairment detection. We examine various non-invasive +indicators of cognitive decline, including speech and language, facial, and +motoric mobility. The paper provides an overview of relevant datasets, +feature-extracting techniques, and deep-learning architectures applied to this +domain. We have analyzed the performance of different methods across modalities +and observed that speech and language-based methods generally achieved the +highest detection performance. Studies combining acoustic and linguistic +features tended to outperform those using a single modality. Facial analysis +methods showed promise for visual modalities but were less extensively studied. +Most papers focused on binary classification (impaired vs. non-impaired), with +fewer addressing multi-class or regression tasks. Transfer learning and +pre-trained language models emerged as popular and effective techniques, +especially for linguistic analysis. Despite significant progress, several +challenges remain, including data standardization and accessibility, model +explainability, longitudinal analysis limitations, and clinical adaptation. +Lastly, we propose future research directions, such as investigating +language-agnostic speech analysis methods, developing multi-modal diagnostic +systems, and addressing ethical considerations in AI-assisted healthcare. By +synthesizing current trends and identifying key obstacles, this review aims to +guide further development of deep learning-based cognitive impairment detection +systems to improve early diagnosis and ultimately patient outcomes. -摘要:組合優化中的基本問題在於識別等效公式,這可能導致更有效的解決策略,並更深入地了解問題的計算複雜性。隨著優化輔助系統(從自然語言描述中產生問題公式的系統)的普及,自動識別問題公式之間等價性的需求也隨之增加。然而,現有的公式等價性檢查方法缺乏依據,依賴於簡單的啟發法,而這對於嚴格驗證來說是不夠的。受 Karp 遞減啟發,我們在這項工作中引入了準 Karp 等價性,這是一個正式標準,用於根據決策變數之間的映射存在性來確定兩個優化公式何時等效。我們提出了 EquivaMap,一個利用大型語言模型自動發現此類映射的框架,實現可擴充且可靠的等價性驗證。為了評估我們的做法,我們構建了第一個等效優化公式的開源資料集,該資料集是通過對現有公式套用轉換(例如添加鬆弛變數或有效不等式)產生的。根據經驗,EquivaMap 明顯優於現有方法,在正確識別公式等價性方面取得了顯著進展。 +摘要:本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標,包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現,並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力,但研究較少。大多數論文專注於二元分類(受損與未受損),較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術,特別是對於語言分析。儘管取得了重大進展,但仍存在一些挑戰,包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後,我們提出了未來的研究方向,例如調查與語言無關的語音分析方法、開發多模式診斷系統,以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙,本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展,以改善早期診斷,並最終改善患者的治療結果。 -##### **On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems** -2502.14759v1 by Juraj Vladika, Florian Matthes +##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems** +2410.17504v1 by Shruthi Chari -Retrieval-augmented generation (RAG) has emerged as an approach to augment -large language models (LLMs) by reducing their reliance on static knowledge and -improving answer factuality. RAG retrieves relevant context snippets and -generates an answer based on them. Despite its increasing industrial adoption, -systematic exploration of RAG components is lacking, particularly regarding the -ideal size of provided context, and the choice of base LLM and retrieval -method. To help guide development of robust RAG systems, we evaluate various -context sizes, BM25 and semantic search as retrievers, and eight base LLMs. -Moving away from the usual RAG evaluation with short answers, we explore the -more challenging long-form question answering in two domains, where a good -answer has to utilize the entire context. Our findings indicate that final QA -performance improves steadily with up to 15 snippets but stagnates or declines -beyond that. Finally, we show that different general-purpose LLMs excel in the -biomedical domain than the encyclopedic one, and that open-domain evidence -retrieval in large corpora is challenging. +Explainable Artificial Intelligence (AI) focuses on helping humans understand +the working of AI systems or their decisions and has been a cornerstone of AI +for decades. Recent research in explainability has focused on explaining the +workings of AI models or model explainability. There have also been several +position statements and review papers detailing the needs of end-users for +user-centered explainability but fewer implementations. Hence, this thesis +seeks to bridge some gaps between model and user-centered explainability. We +create an explanation ontology (EO) to represent literature-derived explanation +types via their supporting components. We implement a knowledge-augmented +question-answering (QA) pipeline to support contextual explanations in a +clinical setting. Finally, we are implementing a system to combine explanations +from different AI methods and data modalities. Within the EO, we can represent +fifteen different explanation types, and we have tested these representations +in six exemplar use cases. We find that knowledge augmentations improve the +performance of base large language models in the contextualized QA, and the +performance is variable across disease groups. In the same setting, clinicians +also indicated that they prefer to see actionability as one of the main foci in +explanations. In our explanations combination method, we plan to use similarity +metrics to determine the similarity of explanations in a chronic disease +detection setting. Overall, through this thesis, we design methods that can +support knowledge-enabled explanations across different use cases, accounting +for the methods in today's AI era that can generate the supporting components +of these explanations and domain knowledge sources that can enhance them. -摘要:檢索增強生成 (RAG) 已成為一種方法,可透過減少大型語言模型 (LLM) 對靜態知識的依賴,並改善答案的真實性,來增強大型語言模型 (LLM)。RAG 會擷取相關的內容片段,並根據這些片段產生答案。儘管其產業採用率不斷提高,但缺乏對 RAG 組成的系統性探討,特別是在提供的內容的理想大小,以及基礎 LLM 和檢索方法的選擇方面。為了協助引導穩健 RAG 系統的開發,我們評估了各種內容大小、BM25 和語意搜尋作為檢索器,以及八個基礎 LLM。我們不再使用簡短答案進行常見的 RAG 評估,而是探討在兩個領域中更具挑戰性的長篇問答,其中一個好的答案必須利用整個內容。我們的研究結果指出,最終的問答效能會隨著多達 15 個片段而穩定提升,但在超過這個數量後就會停滯或下降。最後,我們表明不同的通用 LLM 在生物醫學領域比百科全書領域更為出色,而且在大型語料庫中進行開放領域證據檢索具有挑戰性。 +摘要:可解釋人工智慧(AI)專注於協助人類了解 AI 系統運作或其決策,數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求,但實作較少。因此,本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體(EO)以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答(QA)管線,以在臨床環境中支援情境解釋。最後,我們正在實作一個系統,以結合來自不同 AI 方法和資料模式的解釋。在 EO 中,我們可以表示 15 種不同的解釋類型,並且我們已在六個範例使用案例中測試這些表示。我們發現,知識增強改善了基礎大型語言模型在情境化 QA 中的效能,並且效能因疾病群組而異。在相同的環境中,臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中,我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言,透過本論文,我們設計了可以在不同使用案例中支援知識啟用解釋的方法,考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。 -##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** -2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari +##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study** +2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay -Medical images are acquired at high resolutions with large fields of view in -order to capture fine-grained features necessary for clinical decision-making. -Consequently, training deep learning models on medical images can incur large -computational costs. In this work, we address the challenge of downsizing -medical images in order to improve downstream computational efficiency while -preserving clinically-relevant features. We introduce MedVAE, a family of six -large-scale 2D and 3D autoencoders capable of encoding medical images as -downsized latent representations and decoding latent representations back to -high-resolution images. We train MedVAE autoencoders using a novel two-stage -training approach with 1,052,730 medical images. Across diverse tasks obtained -from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent -representations in place of high-resolution images when training downstream -models can lead to efficiency benefits (up to 70x improvement in throughput) -while simultaneously preserving clinically-relevant features and (2) MedVAE can -decode latent representations back to high-resolution images with high -fidelity. Our work demonstrates that large-scale, generalizable autoencoders -can help address critical efficiency challenges in the medical domain. Our code -is available at https://github.com/StanfordMIMI/MedVAE. +Objectives: To investigate clinicians' attitudes towards current automated +interpretation of ECG and novel AI technologies and their perception of +computer-assisted interpretation. Materials and Methods: We conducted a series +of interviews with clinicians in the UK. Our study: (i) explores the potential +for AI, specifically future 'human-like' computing approaches, to facilitate +ECG interpretation and support clinical decision making, and (ii) elicits their +opinions about the importance of explainability and trustworthiness of AI +algorithms. Results: We performed inductive thematic analysis on interview +transcriptions from 23 clinicians and identified the following themes: (i) a +lack of trust in current systems, (ii) positive attitudes towards future AI +applications and requirements for these, (iii) the relationship between the +accuracy and explainability of algorithms, and (iv) opinions on education, +possible deskilling, and the impact of AI on clinical competencies. Discussion: +Clinicians do not trust current computerised methods, but welcome future 'AI' +technologies. Where clinicians trust future AI interpretation to be accurate, +they are less concerned that it is explainable. They also preferred ECG +interpretation that demonstrated the results of the algorithm visually. Whilst +clinicians do not fear job losses, they are concerned about deskilling and the +need to educate the workforce to use AI responsibly. Conclusion: Clinicians are +positive about the future application of AI in clinical decision-making. +Accuracy is a key factor of uptake and visualisations are preferred over +current computerised methods. This is viewed as a potential means of training +and upskilling, in contrast to the deskilling that automation might be +perceived to bring. -摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 +摘要:目的:調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度,以及他們對電腦輔助解讀的看法。材料和方法:我們對英國的臨床醫生進行了一系列訪談。我們的研究:(i) 探討人工智慧的潛力,特別是未來的「類人類」運算方法,以促進心電圖解讀並支持臨床決策制定,以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果:我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析,並找出以下主題:(i) 對目前系統缺乏信任,(ii) 對未來人工智慧應用和對這些應用的要求持正面態度,(iii) 演算法的準確性和可解釋性之間的關係,以及 (iv) 對教育、可能的技能退化,以及人工智慧對臨床能力的影響的看法。討論:臨床醫生不信任目前的電腦化方法,但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下,他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業,但他們擔心技能退化,以及需要教育員工負責任地使用人工智慧。結論:臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素,而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法,與自動化可能帶來的技能退化形成對比。 -##### **TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators** -2502.14752v1 by Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun +##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer** +2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker -Triton, a high-level Python-like language designed for building efficient GPU -kernels, is widely adopted in deep learning frameworks due to its portability, -flexibility, and accessibility. However, programming and parallel optimization -still require considerable trial and error from Triton developers. Despite -advances in large language models (LLMs) for conventional code generation, -these models struggle to generate accurate, performance-optimized Triton code, -as they lack awareness of its specifications and the complexities of GPU -programming. More critically, there is an urgent need for systematic -evaluations tailored to Triton. In this work, we introduce TritonBench, the -first comprehensive benchmark for Triton operator generation. TritonBench -features two evaluation channels: a curated set of 184 real-world operators -from GitHub and a collection of operators aligned with PyTorch interfaces. -Unlike conventional code benchmarks prioritizing functional correctness, -TritonBench also profiles efficiency performance on widely deployed GPUs -aligned with industry applications. Our study reveals that current -state-of-the-art code LLMs struggle to generate efficient Triton operators, -highlighting a significant gap in high-performance code generation. TritonBench -will be available at https://github.com/thunlp/TritonBench. +The aggressiveness of prostate cancer, the most common cancer in men +worldwide, is primarily assessed based on histopathological data using the +Gleason scoring system. While artificial intelligence (AI) has shown promise in +accurately predicting Gleason scores, these predictions often lack inherent +explainability, potentially leading to distrust in human-machine interactions. +To address this issue, we introduce a novel dataset of 1,015 tissue microarray +core images, annotated by an international group of 54 pathologists. The +annotations provide detailed localized pattern descriptions for Gleason grading +in line with international guidelines. Utilizing this dataset, we develop an +inherently explainable AI system based on a U-Net architecture that provides +predictions leveraging pathologists' terminology. This approach circumvents +post-hoc explainability methods while maintaining or exceeding the performance +of methods trained directly for Gleason pattern segmentation (Dice score: 0.713 +$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason +patterns). By employing soft labels during training, we capture the intrinsic +uncertainty in the data, yielding strong results in Gleason pattern +segmentation even in the context of high interobserver variability. With the +release of this dataset, we aim to encourage further research into segmentation +in medical tasks with high levels of subjectivity and to advance the +understanding of pathologists' reasoning processes. -摘要:Triton 是一種高階的類 Python 語言,專門用於建構高效的 GPU 核心,由於其可移植性、靈活性及可存取性,已廣泛採用於深度學習框架中。然而,編程和並行最佳化仍需要 Triton 開發人員進行大量的試驗和錯誤。儘管大型語言模型 (LLM) 在傳統程式碼產生方面取得了進展,但這些模型在產生準確且效能最佳化的 Triton 程式碼時仍面臨困難,因為它們缺乏對其規格和 GPU 編程複雜性的認識。更重要的是,迫切需要針對 Triton 量身打造的系統性評估。在這項工作中,我們介紹 TritonBench,這是第一個針對 Triton 算子產生進行全面評比的基準。TritonBench 具有兩個評估管道:一組來自 GitHub 的 184 個真實世界算子,以及一組與 PyTorch 介面對齊的算子。與優先考慮功能正確性的傳統程式碼基準不同,TritonBench 還剖析了與產業應用對齊的廣泛部署 GPU 上的效能表現。我們的研究表明,目前最先進的程式碼 LLM 難以產生高效的 Triton 算子,突顯了高性能程式碼產生中的重大差距。TritonBench 將在 https://github.com/thunlp/TritonBench 提供。 +摘要:前列腺癌是全球男性最常見的癌症,其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力,但這些預測通常缺乏內在的可解釋性,可能會導致對人機互動的不信任。為了解決這個問題,我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述,用於符合國際準則的 Gleason 分級。利用這個資料集,我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統,該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法,同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能(Dice 分數:0.713 ± 0.003,訓練於解釋,相對於 0.691 ± 0.010,訓練於 Gleason 模式)。透過在訓練期間採用軟標籤,我們捕捉了資料中的內在不確定性,即使在觀察者間變異性高的情況下,也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集,我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割,並增進對病理學家推理過程的理解。 -##### **Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs** -2502.14748v1 by Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, Jordan Boyd-Graber +##### **Explainable AI Methods for Multi-Omics Analysis: A Survey** +2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee -A common use of NLP is to facilitate the understanding of large document -collections, with a shift from using traditional topic models to Large Language -Models. Yet the effectiveness of using LLM for large corpus understanding in -real-world applications remains under-explored. This study measures the -knowledge users acquire with unsupervised, supervised LLM-based exploratory -approaches or traditional topic models on two datasets. While LLM-based methods -generate more human-readable topics and show higher average win probabilities -than traditional models for data exploration, they produce overly generic -topics for domain-specific datasets that do not easily allow users to learn -much about the documents. Adding human supervision to the LLM generation -process improves data exploration by mitigating hallucination and -over-genericity but requires greater human effort. In contrast, traditional. -models like Latent Dirichlet Allocation (LDA) remain effective for exploration -but are less user-friendly. We show that LLMs struggle to describe the haystack -of large corpora without human help, particularly domain-specific data, and -face scaling and hallucination limitations due to context length constraints. -Dataset available at https://huggingface. co/datasets/zli12321/Bills. +Advancements in high-throughput technologies have led to a shift from +traditional hypothesis-driven methodologies to data-driven approaches. +Multi-omics refers to the integrative analysis of data derived from multiple +'omes', such as genomics, proteomics, transcriptomics, metabolomics, and +microbiomics. This approach enables a comprehensive understanding of biological +systems by capturing different layers of biological information. Deep learning +methods are increasingly utilized to integrate multi-omics data, offering +insights into molecular interactions and enhancing research into complex +diseases. However, these models, with their numerous interconnected layers and +nonlinear relationships, often function as black boxes, lacking transparency in +decision-making processes. To overcome this challenge, explainable artificial +intelligence (xAI) methods are crucial for creating transparent models that +allow clinicians to interpret and work with complex data more effectively. This +review explores how xAI can improve the interpretability of deep learning +models in multi-omics research, highlighting its potential to provide +clinicians with clear insights, thereby facilitating the effective application +of such models in clinical settings. -摘要:NLP 的常見用途是促進對大型文件集合的理解,從使用傳統主題模型轉向大型語言模型。然而,在現實世界的應用中使用 LLM 了解大型語料庫的有效性仍未得到充分探索。本研究衡量了使用者在兩個資料集上使用無監督、監督的基於 LLM 的探索性方法或傳統主題模型獲得的知識。雖然基於 LLM 的方法會產生更多人類可讀的主題,並且顯示出比傳統模型更高的平均獲勝機率,但它們會為特定領域的資料集產生過於通用的主題,而這些主題不容易讓使用者對文件有深入了解。在 LLM 生成過程中加入人類監督可透過減輕幻覺和過度泛化來改善資料探索,但需要更多的人力。相反地,傳統模型(如潛在狄利克雷配置 (LDA))仍然有效於探索,但使用者友善度較低。我們表明,LLM 難以在沒有人類幫助的情況下描述大型語料庫的乾草堆,特別是特定領域的資料,並且會因上下文長度限制而面臨擴充性和幻覺限制。資料集可於 https://huggingface.co/datasets/zli12321/Bills 取得。 +摘要:高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料,例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面,能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料,提供分子交互作用的洞察力,並加強對複雜疾病的研究。然而,這些模型具有許多相互連接的層級和非線性關係,通常會像黑盒子一樣運作,缺乏決策過程的透明度。為了克服此挑戰,可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要,讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性,強調其提供臨床醫生明確見解的潛力,進而促進此類模型在臨床環境中的有效應用。 -##### **HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States** -2502.14744v1 by Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue +##### **Study on the Helpfulness of Explainable Artificial Intelligence** +2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing -The integration of additional modalities increases the susceptibility of -large vision-language models (LVLMs) to safety risks, such as jailbreak -attacks, compared to their language-only counterparts. While existing research -primarily focuses on post-hoc alignment techniques, the underlying safety -mechanisms within LVLMs remain largely unexplored. In this work , we -investigate whether LVLMs inherently encode safety-relevant signals within -their internal activations during inference. Our findings reveal that LVLMs -exhibit distinct activation patterns when processing unsafe prompts, which can -be leveraged to detect and mitigate adversarial inputs without requiring -extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a -novel tuning-free framework that harnesses internal model activations to -enhance safety. Experimental results show that {HiddenDetect} surpasses -state-of-the-art methods in detecting jailbreak attacks against LVLMs. By -utilizing intrinsic safety-aware patterns, our method provides an efficient and -scalable solution for strengthening LVLM robustness against multimodal threats. -Our code will be released publicly at -https://github.com/leigest519/HiddenDetect. +Explainable Artificial Intelligence (XAI) is essential for building advanced +machine learning-powered applications, especially in critical domains such as +medical diagnostics or autonomous driving. Legal, business, and ethical +requirements motivate using effective XAI, but the increasing number of +different methods makes it challenging to pick the right ones. Further, as +explanations are highly context-dependent, measuring the effectiveness of XAI +methods without users can only reveal a limited amount of information, +excluding human factors such as the ability to understand it. We propose to +evaluate XAI methods via the user's ability to successfully perform a proxy +task, designed such that a good performance is an indicator for the explanation +to provide helpful information. In other words, we address the helpfulness of +XAI for human decision-making. Further, a user study on state-of-the-art +methods was conducted, showing differences in their ability to generate trust +and skepticism and the ability to judge the rightfulness of an AI decision +correctly. Based on the results, we highly recommend using and extending this +approach for more objective-based human-centered user studies to measure XAI +performance in an end-to-end fashion. -摘要:整合其他模态会增加大型视觉语言模型 (LVLMs) 对安全风险的敏感性,例如越狱攻击,与仅语言的对应模型相比。虽然现有的研究主要集中于事后对齐技术,但 LVLMs 内部的基本安全机制在很大程度上仍未得到探索。在这项工作中,我们调查了 LVLMs 在推理过程中是否在其内部激活中固有地编码了与安全相关的信号。我们的研究结果表明,LVLMs 在处理不安全提示时表现出不同的激活模式,这可以用来检测和缓解对抗性输入,而无需进行广泛的微调。基于这一见解,我们引入了 HiddenDetect,这是一个新颖的无调优框架,利用内部模型激活来增强安全性。实验结果表明,{HiddenDetect} 在检测针对 LVLMs 的越狱攻击方面超越了最先进的方法。通过利用内在的安全感知模式,我们的方法为加强 LVLM 对多模态威胁的鲁棒性提供了一种高效且可扩展的解决方案。我们的代码将在 https://github.com/leigest519/HiddenDetect 公开发布。 +摘要:可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要,特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI,但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外,由於解釋高度依賴於背景,在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊,排除人類因素,例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法,設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說,我們探討 XAI 對人類決策制定的幫助。此外,對最先進的方法進行使用者研究,顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果,我們強烈建議使用和擴充這種方法,以進行更多以目標為基礎的人為中心使用者研究,以終端到終端的方式衡量 XAI 效能。 -##### **Multi-Agent Coordination across Diverse Applications: A Survey** -2502.14743v1 by Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin-Teng Lin, Yang Shen +##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health** +2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh -Multi-agent coordination studies the underlying mechanism enabling the -trending spread of diverse multi-agent systems (MAS) and has received -increasing attention, driven by the expansion of emerging applications and -rapid AI advances. This survey outlines the current state of coordination -research across applications through a unified understanding that answers four -fundamental coordination questions: (1) what is coordination; (2) why -coordination; (3) who to coordinate with; and (4) how to coordinate. Our -purpose is to explore existing ideas and expertise in coordination and their -connections across diverse applications, while identifying and highlighting -emerging and promising research directions. First, general coordination -problems that are essential to varied applications are identified and analyzed. -Second, a number of MAS applications are surveyed, ranging from widely studied -domains, e.g., search and rescue, warehouse automation and logistics, and -transportation systems, to emerging fields including humanoid and -anthropomorphic robots, satellite systems, and large language models (LLMs). -Finally, open challenges about the scalability, heterogeneity, and learning -mechanisms of MAS are analyzed and discussed. In particular, we identify the -hybridization of hierarchical and decentralized coordination, human-MAS -coordination, and LLM-based MAS as promising future directions. +Early detection of intrapartum risk enables interventions to potentially +prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, +there is no accurate automated system to predict such events to assist with +clinical decision-making. To fill this gap, we propose "Artificial Intelligence +(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning +framework that not only predicts adverse labor outcomes from maternal, fetal, +obstetrical, and intrapartum risk factors but also provides the model's +reasoning behind the predictions made. The latter can provide insights into +what modifications in the input variables of the model could have changed the +predicted outcome. We address the challenges of imbalance and small datasets by +synthesizing additional training data using Adaptive Synthetic Sampling +(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN +uses an ensemble of fully-connected neural networks as the backbone for its +classification with the data augmentation supported by either ADASYN or CTGAN. +AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in +classification. AIMEN can predict a high risk for adverse labor outcomes with +an average F1 score of 0.784. It also provides counterfactual explanations that +can be achieved by changing 2 to 3 attributes on average. Resources available: +https://github.com/ab9mamun/AIMEN. -摘要:多智能體協調研究探討了促成各種多智能體系統 (MAS) 流行擴散的底層機制,並隨著新興應用擴展和 AI 快速進展而受到越來越多的關注。這項調查透過統一的理解來概述協調研究的現狀,回答了四個基本的協調問題:(1) 什麼是協調;(2) 為什麼協調;(3) 與誰協調;以及 (4) 如何協調。我們的目的是探索協調中現有的想法和專業知識,以及它們在不同應用中的關聯,同時找出並強調新興且有前景的研究方向。首先,找出並分析了對各種應用至關重要的協調問題。其次,調查了許多 MAS 應用,範圍從廣泛研究的領域(例如搜尋和救援、倉庫自動化和物流,以及運輸系統),到新興領域,包括人形機器人和擬人機器人、衛星系統和大語言模型 (LLM)。最後,分析並討論了有關 MAS 的可擴充性、異質性和學習機制的開放挑戰。特別是,我們將分層協調和分散式協調、人類-MAS 協調和基於 LLM 的 MAS 的混合視為有前景的未來方向。 +摘要:產程中風險的早期偵測有助於進行干預措施,以預防或減輕不利的生產結果,例如腦性麻痺。目前,沒有準確的自動化系統可以預測此類事件,以協助臨床決策。為了填補這一空白,我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN),這是一個深度學習架構,它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果,還能提供模型做出預測背後的原因。後者可以提供見解,說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料,以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹,並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險,平均 F1 分數為 0.784。它還提供反事實解釋,可透過平均變更 2 至 3 個屬性來達成。可用資源:https://github.com/ab9mamun/AIMEN。 -##### **YOLOv12: A Breakdown of the Key Architectural Features** -2502.14740v1 by Mujadded Al Rabbani Alif, Muhammad Hussain +##### **Artificial intelligence techniques in inherited retinal diseases: A review** +2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian -This paper presents an architectural analysis of YOLOv12, a significant -advancement in single-stage, real-time object detection building upon the -strengths of its predecessors while introducing key improvements. The model -incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and -FlashAttention-driven area-based attention, improving feature extraction, -enhanced efficiency, and robust detections. With multiple model variants, -similar to its predecessors, YOLOv12 offers scalable solutions for both -latency-sensitive and high-accuracy applications. Experimental results manifest -consistent gains in mean average precision (mAP) and inference speed, making -YOLOv12 a compelling choice for applications in autonomous systems, security, -and real-time analytics. By achieving an optimal balance between computational -efficiency and performance, YOLOv12 sets a new benchmark for real-time computer -vision, facilitating deployment across diverse hardware platforms, from edge -devices to high-performance clusters. +Inherited retinal diseases (IRDs) are a diverse group of genetic disorders +that lead to progressive vision loss and are a major cause of blindness in +working-age adults. The complexity and heterogeneity of IRDs pose significant +challenges in diagnosis, prognosis, and management. Recent advancements in +artificial intelligence (AI) offer promising solutions to these challenges. +However, the rapid development of AI techniques and their varied applications +have led to fragmented knowledge in this field. This review consolidates +existing studies, identifies gaps, and provides an overview of AI's potential +in diagnosing and managing IRDs. It aims to structure pathways for advancing +clinical applications by exploring AI techniques like machine learning and deep +learning, particularly in disease detection, progression prediction, and +personalized treatment planning. Special focus is placed on the effectiveness +of convolutional neural networks in these areas. Additionally, the integration +of explainable AI is discussed, emphasizing its importance in clinical settings +to improve transparency and trust in AI-based systems. The review addresses the +need to bridge existing gaps in focused studies on AI's role in IRDs, offering +a structured analysis of current AI techniques and outlining future research +directions. It concludes with an overview of the challenges and opportunities +in deploying AI for IRDs, highlighting the need for interdisciplinary +collaboration and the continuous development of robust, interpretable AI models +to advance clinical applications. -摘要:本文提出 YOLOv12 的架構分析,這是在單階段即時物件偵測領域的重大進展,它建立在前任的優勢之上,同時引入了關鍵改進。該模型結合了最佳化的主幹 (R-ELAN)、7x7 可分離卷積和 FlashAttention 驅動的基於區域的注意力,改進了特徵提取、增強了效率和穩健的偵測。與其前身類似,YOLOv12 具有多種模型變體,為低延遲敏感型和高準確度應用程式提供了可擴充的解決方案。實驗結果顯示在平均準確度 (mAP) 和推論速度方面都有顯著的提升,這使得 YOLOv12 成為自動化系統、安全性和即時分析應用程式的理想選擇。透過在運算效率和效能之間取得最佳平衡,YOLOv12 為即時電腦視覺樹立了新的基準,促進了在各種硬體平台(從邊緣裝置到高性能叢集)上的部署。 +摘要:遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病, +會導致視力逐漸喪失,是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。 +然而,AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究,找出差距,並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術,特別是在疾病檢測、進程預測和個性化治療計劃中,為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外,討論了可解釋 AI 的整合,強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性,提供了對當前 AI 技術的結構化分析,並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇,強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。 -##### **SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines** -2502.14739v1 by M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang +##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures** +2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri -Large language models (LLMs) have demonstrated remarkable proficiency in -mainstream academic disciplines such as mathematics, physics, and computer -science. However, human knowledge encompasses over 200 specialized disciplines, -far exceeding the scope of existing benchmarks. The capabilities of LLMs in -many of these specialized fields-particularly in light industry, agriculture, -and service-oriented disciplines-remain inadequately evaluated. To address this -gap, we present SuperGPQA, a comprehensive benchmark that evaluates -graduate-level knowledge and reasoning capabilities across 285 disciplines. Our -benchmark employs a novel Human-LLM collaborative filtering mechanism to -eliminate trivial or ambiguous questions through iterative refinement based on -both LLM responses and expert feedback. Our experimental results reveal -significant room for improvement in the performance of current state-of-the-art -LLMs across diverse knowledge domains (e.g., the reasoning-focused model -DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting -the considerable gap between current model capabilities and artificial general -intelligence. Additionally, we present comprehensive insights from our -management of a large-scale annotation process, involving over 80 expert -annotators and an interactive Human-LLM collaborative system, offering valuable -methodological guidance for future research initiatives of comparable scope. +Explaining Artificial Intelligence (AI) decisions is a major challenge +nowadays in AI, in particular when applied to sensitive scenarios like medicine +and law. However, the need to explain the rationale behind decisions is a main +issue also for human-based deliberation as it is important to justify +\textit{why} a certain decision has been taken. Resident medical doctors for +instance are required not only to provide a (possibly correct) diagnosis, but +also to explain how they reached a certain conclusion. Developing new tools to +aid residents to train their explanation skills is therefore a central +objective of AI in education. In this paper, we follow this direction, and we +present, to the best of our knowledge, the first multilingual dataset for +Medical Question Answering where correct and incorrect diagnoses for a clinical +case are enriched with a natural language explanation written by doctors. These +explanations have been manually annotated with argument components (i.e., +premise, claim) and argument relations (i.e., attack, support), resulting in +the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases +in four languages (English, Spanish, French, Italian) with explanations, where +we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 +attack relations. We conclude by showing how competitive baselines perform over +this challenging dataset for the argument mining task. -摘要:大型語言模型 (LLM) 已展現出在主流學術領域(如數學、物理和電腦科學)的卓越能力。然而,人類知識包含超過 200 個專業領域,遠遠超過現有基準的範圍。LLM 在許多這些專業領域(特別是在輕工業、農業和服務導向領域)的能力仍未得到充分評估。為了解決這個差距,我們提出了 SuperGPQA,這是一個綜合基準,用於評估 285 個領域的研究生級知識和推理能力。我們的基準採用新穎的人類-LLM 協同過濾機制,透過基於 LLM 回應和專家回饋的迭代改進,來消除瑣碎或模稜兩可的問題。我們的實驗結果顯示,當前最先進的 LLM 在不同知識領域的表現仍有很大的改進空間(例如,以推理為重點的模型 DeepSeek-R1 在 SuperGPQA 上達到了 61.82% 的最高準確度),突顯了當前模型能力與人工通用智慧之間的巨大差距。此外,我們從管理大型註釋過程(涉及 80 多位專家註釋者和一個互動式人類-LLM 協作系統)中提出了全面的見解,為未來具有可比規模的研究計畫提供了寶貴的方法論指導。 +摘要:解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰,特別是應用於像醫學和法律等敏感情境時。然而,解釋決策背後理由的需求也是基於人類的考量的一個主要問題,因為有必要證明為什麼做出某個決策。例如,住院醫師不僅需要提供(可能是正確的)診斷,還需要解釋他們如何達成某個結論。因此,開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中,我們遵循這個方向,並且根據我們的了解,提出第一個多語言醫學問答資料集,其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成(即前提、主張)和論證關係(即攻擊、支持)進行手動註解,產生多語言 CasiMedicos-Arg 資料集,其中包含 558 個具有解釋的四種語言(英語、西班牙語、法語、義大利語)的臨床病例,我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。 -##### **EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration** -2502.14735v1 by Minjie Hong, Yan Xia, Zehan Wang, Jieming Zhu, Ye Wang, Sihang Cai, Xiaoda Yang, Quanyu Dai, Zhenhua Dong, Zhimeng Zhang, Zhou Zhao +##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration** +2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu -Large language models (LLMs) are increasingly leveraged as foundational -backbones in the development of advanced recommender systems, offering enhanced -capabilities through their extensive knowledge and reasoning. Existing -llm-based recommender systems (RSs) often face challenges due to the -significant differences between the linguistic semantics of pre-trained LLMs -and the collaborative semantics essential for RSs. These systems use -pre-trained linguistic semantics but learn collaborative semantics from scratch -via the llm-Backbone. However, LLMs are not designed for recommendations, -leading to inefficient collaborative learning, weak result correlations, and -poor integration of traditional RS features. To address these challenges, we -propose EAGER-LLM, a decoder-only llm-based generative recommendation framework -that integrates endogenous and exogenous behavioral and semantic information in -a non-intrusive manner. Specifically, we propose 1)dual-source knowledge-rich -item indices that integrates indexing sequences for exogenous signals, enabling -efficient link-wide processing; 2)non-invasive multiscale alignment -reconstruction tasks guide the model toward a deeper understanding of both -collaborative and semantic signals; 3)an annealing adapter designed to finely -balance the model's recommendation performance with its comprehension -capabilities. We demonstrate EAGER-LLM's effectiveness through rigorous testing -on three public benchmarks. +Diagnosis prediction is a critical task in healthcare, where timely and +accurate identification of medical conditions can significantly impact patient +outcomes. Traditional machine learning and deep learning models have achieved +notable success in this domain but often lack interpretability which is a +crucial requirement in clinical settings. In this study, we explore the use of +neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop +explainable models for diagnosis prediction. Essentially, we design and +implement LNN-based models that integrate domain-specific knowledge through +logical rules with learnable thresholds. Our models, particularly +$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior +performance over traditional models such as Logistic Regression, SVM, and +Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up +to 0.8457) in the case study of diabetes prediction. The learned weights and +thresholds within the LNN models provide direct insights into feature +contributions, enhancing interpretability without compromising predictive +power. These findings highlight the potential of neuro-symbolic approaches in +bridging the gap between accuracy and explainability in healthcare AI +applications. By offering transparent and adaptable diagnostic models, our work +contributes to the advancement of precision medicine and supports the +development of equitable healthcare solutions. Future research will focus on +extending these methods to larger and more diverse datasets to further validate +their applicability across different medical conditions and populations. -摘要:大型語言模型(LLM)正日益被用作先進推薦系統開發中的基礎主幹,透過其廣泛的知識和推理能力提供增強功能。現有的基於 LLM 的推薦系統(RS)通常會因為預先訓練的 LLM 語言語義與 RS 必備的協作語義之間的顯著差異而面臨挑戰。這些系統使用預先訓練的語言語義,但透過 LLM 主幹從頭學習協作語義。然而,LLM 並非專為推薦而設計,導致協作學習效率低落、結果關聯性薄弱,以及與傳統 RS 功能整合不佳。為了應對這些挑戰,我們提出 EAGER-LLM,這是一種僅解碼器、基於 LLM 的生成推薦架構,能以非侵入性方式整合內生和外生行為和語義資訊。具體來說,我們提出 1) 雙來源、知識豐富的項目索引,它整合了外生訊號的索引序列,實現了高效的鏈路廣泛處理;2) 非侵入式多尺度對齊重建任務引導模型更深入地理解協作和語義訊號;3) 退火適配器旨在精細地平衡模型的推薦效能與其理解能力。我們透過在三個公共基準上的嚴格測試證明了 EAGER-LLM 的有效性。 +摘要:診斷預測是醫療保健中的關鍵任務,及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功,但通常缺乏可解釋性,這在臨床環境中是一項關鍵要求。在本研究中,我們探討了神經符號方法的應用,特別是邏輯神經網路 (LNN),以開發用於診斷預測的可解釋模型。基本上,我們設計並實作了基於 LNN 的模型,這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型,特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$,表現出優於傳統模型(例如邏輯迴歸、SVM 和隨機森林)的優異效能,在糖尿病預測的案例研究中達到了更高的準確度(高達 80.52%)和 AUROC 分數(高達 0.8457)。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解,增強了可解釋性,同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型,我們的研究有助於推進精準醫療,並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集,以進一步驗證其在不同醫療狀況和人群中的適用性。 -##### **Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models** -2502.14734v1 by Hongji Li, Andrianos Michail, Reto Gubelmann, Simon Clematide, Juri Opitz +##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare** +2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty -We propose the Sentence Smith framework that enables controlled and specified -manipulation of text meaning. It consists of three main steps: 1. Parsing a -sentence into a semantic graph, 2. Applying human-designed semantic -manipulation rules, and 3. Generating text from the manipulated graph. A final -filtering step (4.) ensures the validity of the applied transformation. To -demonstrate the utility of Sentence Smith in an application study, we use it to -generate hard negative pairs that challenge text embedding models. Since the -controllable generation makes it possible to clearly isolate different types of -semantic shifts, we can gain deeper insights into the specific strengths and -weaknesses of widely used text embedding models, also addressing an issue in -current benchmarking where linguistic phenomena remain opaque. Human validation -confirms that the generations produced by Sentence Smith are highly accurate. +The rapid advancements in artificial intelligence (AI) have revolutionized +smart healthcare, driving innovations in wearable technologies, continuous +monitoring devices, and intelligent diagnostic systems. However, security, +explainability, robustness, and performance optimization challenges remain +critical barriers to widespread adoption in clinical environments. This +research presents an innovative algorithmic method using the Adaptive Feature +Evaluator (AFE) algorithm to improve feature selection in healthcare datasets +and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable +Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT), +the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby +enhancing predictive accuracy and interpretability. The proposed method is +validated across three diverse healthcare datasets using six distinct machine +learning algorithms, demonstrating its robustness and superiority over +conventional feature selection techniques. The results underscore the +transformative potential of AFE in smart healthcare, enabling personalized and +transparent patient care. Notably, the AFE algorithm, when combined with a +Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting +its capability to improve clinical decision-making processes in real-world +healthcare applications. -摘要:我們提出 Sentence Smith 框架,它能控制並指定文本含義的處理。它包含三個主要步驟:1. 將句子解析成語義圖形,2. 套用人為設計的語義處理規則,3. 從處理過的圖形生成文本。最後的過濾步驟 (4.) 確保套用轉換的有效性。為了在應用研究中展示 Sentence Smith 的效用,我們使用它來產生挑戰文本嵌入模型的困難負面對。由於可控生成能清楚地隔離不同類型的語義轉移,我們能更深入地了解廣泛使用的文本嵌入模型的具體優點和缺點,同時也解決了語言現象在當前基準測試中仍然不透明的問題。人為驗證確認 Sentence Smith 產生的生成高度準確。 +摘要:人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健,推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而,安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法,使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT),該演算法最佳化了臨床決策支援系統 (CDSS),從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集,證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力,實現了個人化和透明的患者照護。值得注意的是,AFE 演算法與多層感知器 (MLP) 結合使用時,準確度高達 98.5%,突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。 -##### **WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models** -2502.14727v1 by Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao +##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study** +2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker -Retrieval Augmented Generation (RAG) has gained widespread adoption owing to -its capacity to empower large language models (LLMs) to integrate external -knowledge. However, existing RAG frameworks are primarily designed for -text-based LLMs and rely on Automatic Speech Recognition to process speech -input, which discards crucial audio information, risks transcription errors, -and increases computational overhead. Therefore, we introduce WavRAG, the first -retrieval augmented generation framework with native, end-to-end audio support. -WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw -audio for both embedding and retrieval. 2) WavRAG integrates audio and text -into a unified knowledge representation. Specifically, we propose the -WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge -base, and further enhance the in-context capabilities of spoken dialogue models -through the integration of chain-of-thought reasoning. In comparison to -state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval -performance while delivering a 10x acceleration. Furthermore, WavRAG's unique -text-audio hybrid retrieval capability extends the boundaries of RAG to the -audio modality. +Artificial intelligence (AI) systems have substantially improved +dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI) +systems further enhancing clinicians' confidence and trust in AI-driven +decisions. Despite these advancements, there remains a critical need for +objective evaluation of how dermatologists engage with both AI and XAI tools. +In this study, 76 dermatologists participated in a reader study, diagnosing 16 +dermoscopic images of melanomas and nevi using an XAI system that provides +detailed, domain-specific explanations. Eye-tracking technology was employed to +assess their interactions. Diagnostic performance was compared with that of a +standard AI system lacking explanatory features. Our findings reveal that XAI +systems improved balanced diagnostic accuracy by 2.8 percentage points relative +to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and +complex lesions were associated with elevated cognitive load, as evidenced by +increased ocular fixations. These insights have significant implications for +clinical practice, the design of AI tools for visual tasks, and the broader +development of XAI in medical diagnostics. -摘要:檢索增強生成 (RAG) 因其賦能大型語言模型 (LLM) 整合外部知識的能力而獲得廣泛採用。然而,現有的 RAG 框架主要設計用於基於文字的 LLM,並依賴自動語音辨識處理語音輸入,這會捨棄重要的音訊資訊、有轉錄錯誤的風險,並增加運算負擔。因此,我們引入了 WavRAG,這是第一個具備原生端對端音訊支援的檢索增強生成框架。WavRAG 提供兩個主要功能:1) 繞過 ASR,WavRAG 直接處理原始音訊以進行嵌入和檢索。2) WavRAG 將音訊和文字整合到統一的知識表示中。具體來說,我們提出了 WavRetriever 以利於從文字音訊混合知識庫中進行檢索,並透過整合思考鏈推理進一步增強對話模型的語境能力。與最先進的 ASR 文字 RAG 管線相比,WavRAG 達到了相當的檢索效能,同時提供了 10 倍的加速。此外,WavRAG 獨特的文字音訊混合檢索能力將 RAG 的界線延伸到音訊模式。 +摘要:人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度,而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展,對於皮膚科醫師如何使用 AI 和 XAI 工具,仍有客觀評估的迫切需求。在這項研究中,76 位皮膚科醫師參與了一項讀者研究,使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像,該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示,XAI 系統相較於標準 AI,將平衡診斷準確度提升了 2.8 個百分點。此外,與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關,這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。 -##### **Entity Framing and Role Portrayal in the News** -2502.14718v1 by Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov +##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data** +2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar -We introduce a novel multilingual hierarchical corpus annotated for entity -framing and role portrayal in news articles. The dataset uses a unique taxonomy -inspired by storytelling elements, comprising 22 fine-grained roles, or -archetypes, nested within three main categories: protagonist, antagonist, and -innocent. Each archetype is carefully defined, capturing nuanced portrayals of -entities such as guardian, martyr, and underdog for protagonists; tyrant, -deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for -innocents. The dataset includes 1,378 recent news articles in five languages -(Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two -critical domains of global significance: the Ukraine-Russia War and Climate -Change. Over 5,800 entity mentions have been annotated with role labels. This -dataset serves as a valuable resource for research into role portrayal and has -broader implications for news analysis. We describe the characteristics of the -dataset and the annotation process, and we report evaluation results on -fine-tuned state-of-the-art multilingual transformers and hierarchical -zero-shot learning using LLMs at the level of a document, a paragraph, and a -sentence. +Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been +shown to significantly improve the quality of life of autistic individuals. +However, diagnostics methods for ASD rely on assessments based on clinical +presentation that are prone to bias and can be challenging to arrive at an +early diagnosis. There is a need for objective biomarkers of ASD which can help +improve diagnostic accuracy. Deep learning (DL) has achieved outstanding +performance in diagnosing diseases and conditions from medical imaging data. +Extensive research has been conducted on creating models that classify ASD +using resting-state functional Magnetic Resonance Imaging (fMRI) data. However, +existing models lack interpretability. This research aims to improve the +accuracy and interpretability of ASD diagnosis by creating a DL model that can +not only accurately classify ASD but also provide explainable insights into its +working. The dataset used is a preprocessed version of the Autism Brain Imaging +Data Exchange (ABIDE) with 884 samples. Our findings show a model that can +accurately classify ASD and highlight critical brain regions differing between +ASD and typical controls, with potential implications for early diagnosis and +understanding of the neural basis of ASD. These findings are validated by +studies in the literature that use different datasets and modalities, +confirming that the model actually learned characteristics of ASD and not just +the dataset. This study advances the field of explainable AI in medical imaging +by providing a robust and interpretable model, thereby contributing to a future +with objective and reliable ASD diagnostics. -摘要:我們引進一個新穎的多語言層級語料庫,其中註解了新聞文章中的實體框架和角色描繪。此資料集使用了一個獨特的分類法,其靈感來自講故事元素,包含 22 個細緻的角色或原型,嵌套在三個主要類別中:主角、對手和無辜者。每個原型都經過仔細定義,捕捉了實體的細微描繪,例如主角的監護人、烈士和弱者;對手的暴君、欺騙者和偏執狂;以及無辜者的受害者、替罪羊和被剝削者。該資料集包括五種語言(保加利亞語、英語、印地語、歐洲葡萄牙語和俄語)中的 1,378 篇近期新聞文章,重點關注兩個具有全球意義的關鍵領域:烏克蘭-俄羅斯戰爭和氣候變遷。超過 5,800 個實體提及已註解為角色標籤。此資料集作為角色描繪研究的寶貴資源,並對新聞分析有更廣泛的影響。我們描述了資料集的特徵和註解過程,並報告了對使用 LLM 在文件、段落和句子層級進行微調的最新多語言轉換器和層級零次學習的評估結果。 +摘要:自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而,ASD 的診斷方法依賴於基於臨床表現的評估,容易產生偏見,且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記,以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而,現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD,還能提供可解釋見解說明其運作原理的 DL 模型,來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本,包含 884 個樣本。我們的研究結果顯示,該模型能準確分類 ASD,並強調 ASD 與典型對照組之間存在差異的關鍵腦區,對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證,證實該模型實際上學習了 ASD 的特徵,而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型,推動了醫學影像中可解釋 AI 的領域,從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。 -##### **From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT** -2502.14714v1 by Ahmed Abdeen Hamed, Byung Suk Lee +##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition** +2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul -The generative capabilities of LLM models present opportunities in -accelerating tasks and concerns with the authenticity of the knowledge it -produces. To address the concerns, we present a computational approach that -systematically evaluates the factual accuracy of biomedical knowledge that an -LLM model has been prompted to generate. Our approach encompasses two -processes: the generation of disease-centric associations and the verification -of them using the semantic knowledge of the biomedical ontologies. Using -ChatGPT as the select LLM model, we designed a set of prompt-engineering -processes to generate linkages between diseases, drugs, symptoms, and genes to -establish grounds for assessments. Experimental results demonstrate high -accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and -genetic information (88%-98%). The symptom term identification accuracy was -notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO -ontologies accordingly. The verification of associations reveals literature -coverage rates of (89%-91%) among disease-drug and disease-gene associations. -The low identification accuracy for symptom terms also contributed to the -verification of symptom-related associations (49%-62%). +The in-vivo identification of the kidney stone types during an ureteroscopy +would be a major medical advance in urology, as it could reduce the time of the +tedious renal calculi extraction process, while diminishing infection risks. +Furthermore, such an automated procedure would make possible to prescribe +anti-recurrence treatments immediately. Nowadays, only few experienced +urologists are able to recognize the kidney stone types in the images of the +videos displayed on a screen during the endoscopy. Thus, several deep learning +(DL) models have recently been proposed to automatically recognize the kidney +stone types using ureteroscopic images. However, these DL models are of black +box nature whicl limits their applicability in clinical settings. This +contribution proposes a case-based reasoning DL model which uses prototypical +parts (PPs) and generates local and global descriptors. The PPs encode for each +class (i.e., kidney stone type) visual feature information (hue, saturation, +intensity and textures) similar to that used by biologists. The PPs are +optimally generated due a new loss function used during the model training. +Moreover, the local and global descriptors of PPs allow to explain the +decisions ("what" information, "where in the images") in an understandable way +for biologists and urologists. The proposed DL model has been tested on a +database including images of the six most widespread kidney stone types. The +overall average classification accuracy was 90.37. When comparing this results +with that of the eight other DL models of the kidney stone state-of-the-art, it +can be seen that the valuable gain in explanability was not reached at the +expense of accuracy which was even slightly increased with respect to that +(88.2) of the best method of the literature. These promising and interpretable +results also encourage urologists to put their trust in AI-based solutions. -摘要:LLM 模型的生成能力為加速任務和對其產生的知識真實性的疑慮提供了機會。為了解決這些疑慮,我們提出了計算方法,系統性評估 LLM 模型受提示而產生的生物醫學知識的事實準確性。我們的做法包括兩個過程:生成以疾病為中心的關聯,並使用生物醫學本体的語義知識驗證它們。使用 ChatGPT 作為選定的 LLM 模型,我們設計了一組提示工程流程,以生成疾病、藥物、症狀和基因之間的關聯,作為評估的依據。實驗結果證明在識別疾病術語 (88%-97%)、藥物名稱 (90%-91%) 和遺傳資訊 (88%-98%) 方面具有很高的準確性。症狀術語識別準確性顯著較低 (49%-61%),並根據 DOID、ChEBI、SYMPTOM 和 GO 本体進行驗證。關聯驗證顯示疾病-藥物和疾病-基因關聯的文獻覆蓋率為 (89%-91%)。症狀術語的低識別準確性也影響了症狀相關關聯的驗證 (49%-62%)。 +摘要:尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展,因為它可以減少繁瑣的腎結石取出過程的時間,同時降低感染風險。此外,這種自動化程序將使立即開立抗復發治療成為可能。如今,只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此,最近已提出多種深度學習 (DL) 模型,以使用輸尿管鏡圖像自動識別腎結石類型。然而,這些 DL 模型本質上是黑盒子,這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型,它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型(即腎結石類型)編碼視覺特徵信息(色調、飽和度、強度和紋理),類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數,PP 得到了最佳生成。此外,PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策(“什麼”信息,“圖像中的什麼位置”)。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時,可以看出,可解釋性的寶貴增益並未以準確性為代價,甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。 -##### **Data-Efficient Pretraining with Group-Level Data Influence Modeling** -2502.14709v1 by Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong +##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques** +2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman -Data-efficient pretraining has shown tremendous potential to elevate scaling -laws. This paper argues that effective pretraining data should be curated at -the group level, treating a set of data points as a whole rather than as -independent contributors. To achieve that, we propose Group-Level Data -Influence Modeling (Group-MATES), a novel data-efficient pretraining method -that captures and optimizes group-level data utility. Specifically, Group-MATES -collects oracle group-level influences by locally probing the pretraining model -with data sets. It then fine-tunes a relational data influence model to -approximate oracles as relationship-weighted aggregations of individual -influences. The fine-tuned model selects the data subset by maximizing its -group-level influence prediction, with influence-aware clustering to enable -efficient inference. Experiments on the DCLM benchmark demonstrate that -Group-MATES achieves a 10% relative core score improvement on 22 downstream -tasks over DCLM-Baseline and 5% over individual-influence-based methods, -establishing a new state-of-the-art. Further analyses highlight the -effectiveness of relational data influence models in capturing intricate -interactions between data points. +This study explores the potential of utilizing administrative claims data, +combined with advanced machine learning and deep learning techniques, to +predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal +Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major +health insurance organization to develop prediction models for multiple +observation windows using traditional machine learning methods such as Random +Forest and XGBoost as well as deep learning approaches such as Long Short-Term +Memory (LSTM) networks. Our findings demonstrate that the LSTM model, +particularly with a 24-month observation window, exhibits superior performance +in predicting ESRD progression, outperforming existing models in the +literature. We further apply SHapley Additive exPlanations (SHAP) analysis to +enhance interpretability, providing insights into the impact of individual +features on predictions at the individual patient level. This study underscores +the value of leveraging administrative claims data for CKD management and +predicting ESRD progression. -摘要:資料有效的預訓練已展現出提升規模化定律的巨大潛力。本文認為,有效的預訓練資料應在群組層級中進行策展,將資料點集合視為一個整體,而非獨立的貢獻者。為達成此目的,我們提出群組層級資料影響建模(Group-MATES),這是一種新穎的資料有效預訓練方法,可擷取和最佳化群組層級資料效用。具體而言,Group-MATES 透過使用資料集在區域探測預訓練模型,收集神諭群組層級影響。接著,微調關係資料影響模型,以關係加權聚合個別影響來近似神諭。微調模型透過最大化其群組層級影響預測,選取資料子集,並透過考量影響的群集,啟用有效率的推論。在 DCLM 基準上的實驗證明,與 DCLM-Baseline 相比,Group-MATES 在 22 個下游任務上達成 10% 的相對核心分數提升,並比基於個別影響的方法高出 5%,建立了新的技術水準。進一步的分析強調了關係資料影響模型在擷取資料點之間的複雜互動上的有效性。 +摘要:本研究探討利用行政申報資料,結合先進機器學習與深度學習技術,預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集,使用傳統機器學習方法(例如隨機森林和 XGBoost)以及深度學習方法(例如長期短期記憶 (LSTM) 網路)開發多個觀察視窗的預測模型。我們的研究結果顯示,LSTM 模型(尤其是 24 個月觀察視窗)在預測 ESRD 進展方面表現優異,優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性,深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。 -##### **Human Misperception of Generative-AI Alignment: A Laboratory Experiment** -2502.14708v1 by Kevin He, Ran Shorrer, Mengjia Xia +##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases** +2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller -We conduct an incentivized laboratory experiment to study people's perception -of generative artificial intelligence (GenAI) alignment in the context of -economic decision-making. Using a panel of economic problems spanning the -domains of risk, time preference, social preference, and strategic -interactions, we ask human subjects to make choices for themselves and to -predict the choices made by GenAI on behalf of a human user. We find that -people overestimate the degree of alignment between GenAI's choices and human -choices. In every problem, human subjects' average prediction about GenAI's -choice is substantially closer to the average human-subject choice than it is -to the GenAI choice. At the individual level, different subjects' predictions -about GenAI's choice in a given problem are highly correlated with their own -choices in the same problem. We explore the implications of people -overestimating GenAI alignment in a simple theoretical model. +While large language models (LLMs) have shown promise for medical question +answering, there is limited work focused on tropical and infectious +disease-specific exploration. We build on an opensource tropical and infectious +diseases (TRINDs) dataset, expanding it to include demographic and semantic +clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM +performance on these, comparing generalist and medical LLMs, as well as LLM +outcomes to human experts. We demonstrate through systematic experimentation, +the benefit of contextual information such as demographics, location, gender, +risk factors for optimal LLM response. Finally we develop a prototype of +TRINDs-LM, a research tool that provides a playground to navigate how context +impacts LLM outputs for health. -摘要:我們進行一項誘因實驗室實驗,以研究人們對生成式人工智慧 (GenAI) 在經濟決策制定中的對齊認知。使用涵蓋風險、時間偏好、社會偏好和策略性互動領域的經濟問題小組,我們要求受試者為自己做出選擇,並預測 GenAI 代表人類使用者做出的選擇。我們發現人們高估了 GenAI 選擇和人類選擇之間的對齊程度。在每個問題中,受試者對 GenAI 選擇的平均預測都比對 GenAI 選擇的預測更接近於平均人類受試者選擇。在個人層面上,不同受試者對特定問題中 GenAI 選擇的預測與他們在同一個問題中的選擇高度相關。我們在一個簡單的理論模型中探討了人們高估 GenAI 對齊的影響。 +摘要:儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景,但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上,並將其擴展為納入人口統計和語義臨床和消費者擴充,產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能,比較了通才和醫療 LLM,以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊(例如人口統計、位置、性別、最佳 LLM 回應的風險因素)的好處。最後,我們開發了 TRINDs-LM 的原型,這是一個研究工具,提供一個探索背景如何影響 LLM 健康輸出的平台。 -##### **Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting** -2502.14704v1 by Yuxuan Yang, Dalin Zhang, Yuxuan Liang, Hua Lu, Huan Li, Gang Chen +##### **Explainable AI: Definition and attributes of a good explanation for health AI** +2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group -Time Series Forecasting (TSF) is a crucial task in various domains, yet -existing TSF models rely heavily on high-quality data and insufficiently -exploit all available data. This paper explores a novel self-supervised -approach to re-label time series datasets by inherently constructing candidate -datasets. During the optimization of a simple reconstruction network, -intermediates are used as pseudo labels in a self-supervised paradigm, -improving generalization for any predictor. We introduce the Self-Correction -with Adaptive Mask (SCAM), which discards overfitted components and selectively -replaces them with pseudo labels generated from reconstructions. Additionally, -we incorporate Spectral Norm Regularization (SNR) to further suppress -overfitting from a loss landscape perspective. Our experiments on eleven -real-world datasets demonstrate that SCAM consistently improves the performance -of various backbone models. This work offers a new perspective on constructing -datasets and enhancing the generalization of TSF models through self-supervised -learning. +Proposals of artificial intelligence (AI) solutions based on increasingly +complex and accurate predictive models are becoming ubiquitous across many +disciplines. As the complexity of these models grows, transparency and users' +understanding often diminish. This suggests that accurate prediction alone is +insufficient for making an AI-based solution truly useful. In the development +of healthcare systems, this introduces new issues related to accountability and +safety. Understanding how and why an AI system makes a recommendation may +require complex explanations of its inner workings and reasoning processes. +Although research on explainable AI (XAI) has significantly increased in recent +years and there is high demand for XAI in medicine, defining what constitutes a +good explanation remains ad hoc, and providing adequate explanations continues +to be challenging. To fully realize the potential of AI, it is critical to +address two fundamental questions about explanations for safety-critical AI +applications, such as health-AI: (1) What is an explanation in health-AI? and +(2) What are the attributes of a good explanation in health-AI? In this study, +we examined published literature and gathered expert opinions through a +two-round Delphi study. The research outputs include (1) a definition of what +constitutes an explanation in health-AI and (2) a comprehensive list of +attributes that characterize a good explanation in health-AI. -摘要:時間序列預測 (TSF) 在各個領域中都是一項重要的任務,但現有的 TSF 模型極度依賴高品質的資料,且無法充分利用所有可用的資料。本文探討了一種新穎的自監督方法,藉由內建地建構候選資料集來重新標記時間序列資料集。在最佳化一個簡單的重建網路過程中,中間產物會在自監督範例中作為偽標籤,進而改善任何預測器的概化能力。我們引入了帶有自適應遮罩 (SCAM) 的自我修正,它會捨棄過度擬合的組成,並選擇性地以從重建產生的偽標籤取代它們。此外,我們納入了頻譜範數正規化 (SNR) 來進一步抑制從損失景觀觀點來看產生的過度擬合。我們在 11 個真實世界的資料集上進行的實驗,證明 SCAM 持續改善各種主幹模型的效能。這項工作提供了建構資料集和透過自監督學習來提升 TSF 模型概化能力的新觀點。 +摘要:隨著越來越複雜且準確的預測模型,基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加,透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中,這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加,且醫學領域對 XAI 有很高的需求,但定義什麼構成一個好的解釋仍是臨時性的,而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力,對於安全關鍵型 AI 應用(例如健康 AI)的解釋,探討兩個基本問題至關重要:(1) 什麼是健康 AI 中的解釋?以及 (2) 健康 AI 中一個好的解釋有哪些屬性?在本研究中,我們檢視了已發表的文獻,並透過兩輪德爾菲研究收集了專家意見。研究成果包括:(1) 健康 AI 中什麼構成解釋的定義,以及 (2) 健康 AI 中一個好解釋的屬性清單。 -##### **I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search** -2502.14693v1 by Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu +##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare** +2408.17401v2 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni -Recent advancements in large language models (LLMs) have shown remarkable -potential in automating machine learning tasks. However, existing LLM-based -agents often struggle with low-diversity and suboptimal code generation. While -recent work has introduced Monte Carlo Tree Search (MCTS) to address these -issues, limitations persist in the quality and diversity of thoughts generated, -as well as in the scalar value feedback mechanisms used for node selection. In -this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a -novel approach that iteratively expands tree nodes through an introspective -process that meticulously analyzes solutions and results from parent and -sibling nodes. This facilitates a continuous refinement of the node in the -search tree, thereby enhancing the overall decision-making process.Furthermore, -we integrate a Large Language Model (LLM)-based value model to facilitate -direct evaluation of each node's solution prior to conducting comprehensive -computational rollouts. A hybrid rewarding mechanism is implemented to -seamlessly transition the Q-value from LLM-estimated scores to actual -performance scores. This allows higher-quality nodes to be traversed -earlier.Applied to the various ML tasks, our approach demonstrates a6\% -absolute improvement in performance compared to the strong open-source AutoML -agents, showcasing its effectiveness in enhancing agentic AutoML systems. +AI-driven tools for healthcare are widely acknowledged as potentially +beneficial to health practitioners and patients, e.g. the QCancer regression +tool for cancer risk prediction. However, for these tools to be trusted, they +need to be supplemented with explanations. We examine how explanations' content +and format affect user comprehension and trust when explaining QCancer's +predictions. Regarding content, we deploy SHAP and Occlusion-1. Regarding +format, we present SHAP explanations, conventionally, as charts (SC) and +Occlusion-1 explanations as charts (OC) as well as text (OT), to which their +simpler nature lends itself. We conduct experiments with two sets of +stakeholders: the general public (representing patients) and medical students +(representing healthcare practitioners). Our experiments showed higher +subjective comprehension and trust for Occlusion-1 over SHAP explanations based +on content. However, when controlling for format, only OT outperformed SC, +suggesting this trend is driven by preferences for text. Other findings +corroborated that explanation format, rather than content, is often the +critical factor. -摘要:大型語言模型 (LLM) 的最新進展已展現出自動化機器學習任務的顯著潛力。然而,現有的基於 LLM 的代理通常會遇到低多樣性和次優代碼生成的問題。雖然最近的工作已引入蒙地卡羅樹搜尋 (MCTS) 來解決這些問題,但仍存在於所產生想法的品質和多樣性,以及用於節點選擇的標量值回饋機制中。在本研究中,我們介紹了內省蒙地卡羅樹搜尋 (I-MCTS),這是一種透過內省過程反覆擴展樹節點的新方法,該過程會細緻地分析來自父節點和同層節點的解決方案和結果。這有助於持續改善搜尋樹中的節點,進而增強整體決策制定過程。此外,我們整合了一個基於大型語言模型 (LLM) 的值模型,以便在進行全面運算展開之前直接評估每個節點的解決方案。實作了一種混合獎勵機制,以無縫地將 Q 值從 LLM 估計分數轉換為實際效能分數。這允許較高品質的節點更早被遍歷。應用於各種 ML 任務,我們的做法展示出比強大的開源 AutoML 代理高出 6% 的絕對效能提升,證明了其在增強代理式 AutoML 系統方面的有效性。 +摘要:由 AI 驅動的醫療保健工具被廣泛認為對醫療從業者和患者有潛在好處,例如用於癌症風險預測的 QCancer 回歸工具。然而,對於這些工具,如果要讓人們信賴,就需要補充說明。我們研究了說明的內容和格式如何影響使用者在解釋 QCancer 預測時的理解和信任。關於內容,我們部署了 SHAP 和 Occlusion-1。關於格式,我們以圖表 (SC) 的形式呈現 SHAP 說明,以圖表 (OC) 和文字 (OT) 的形式呈現 Occlusion-1 說明,因為它們的性質較為簡單。我們對兩組利害關係人進行了實驗:一般民眾(代表患者)和醫學生(代表醫療從業者)。我們的實驗結果顯示,基於內容,Occlusion-1 比 SHAP 說明具有更高的主觀理解和信任。然而,在控制格式時,只有 OT 優於 SC,這表明這種趨勢是由對文字的偏好所驅動的。其他發現證實了說明格式,而不是內容,通常是關鍵因素。 -##### **Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup** -2502.14682v1 by Yonghui Kong, Hongbing Hu, Dan Zhang, Siyuan Chai, Fan Zhang, Wei Wang +##### **A Survey for Large Language Models in Biomedicine** +2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen -Large language models have demonstrated excellent performance in many tasks, -including Text-to-SQL, due to their powerful in-context learning capabilities. -They are becoming the mainstream approach for Text-to-SQL. However, these -methods still have a significant gap compared to human performance, especially -on complex questions. As the complexity of questions increases, the gap between -questions and SQLs increases. We identify two important gaps: the structural -mapping gap and the lexical mapping gap. To tackle these two gaps, we propose -PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates -gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM). -AQP aims to obtain the structural pattern of the question by removing -database-related information, which enables us to find structurally similar -demonstrations. CSM aims to associate database-related text span in the -question with specific tables or columns in the database, which alleviates the -lexical mapping gap. Experimental results on the Spider and BIRD datasets -demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL + -GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution -accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an -execution accuracy of 64.67\%. +Recent breakthroughs in large language models (LLMs) offer unprecedented +natural language understanding and generation capabilities. However, existing +surveys on LLMs in biomedicine often focus on specific applications or model +architectures, lacking a comprehensive analysis that integrates the latest +advancements across various biomedical domains. This review, based on an +analysis of 484 publications sourced from databases including PubMed, Web of +Science, and arXiv, provides an in-depth examination of the current landscape, +applications, challenges, and prospects of LLMs in biomedicine, distinguishing +itself by focusing on the practical implications of these models in real-world +biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot +learning across a broad spectrum of biomedical tasks, including diagnostic +assistance, drug discovery, and personalized medicine, among others, with +insights drawn from 137 key studies. Then, we discuss adaptation strategies of +LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to +enhance their performance in specialized biomedical contexts where zero-shot +fails to achieve, such as medical question answering and efficient processing +of biomedical literature. Finally, we discuss the challenges that LLMs face in +the biomedicine domain including data privacy concerns, limited model +interpretability, issues with dataset quality, and ethics due to the sensitive +nature of biomedical data, the need for highly reliable model outputs, and the +ethical implications of deploying AI in healthcare. To address these +challenges, we also identify future research directions of LLM in biomedicine +including federated learning methods to preserve data privacy and integrating +explainable AI methodologies to enhance the transparency of LLMs. + +摘要:大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而,現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構,缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析,深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景,其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先,我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力,包括診斷輔助、藥物發現和個性化醫療等,並從 137 項關鍵研究中汲取見解。然後,我們討論了 LLM 的適應策略,包括單模態和多模態 LLM 的微調方法,以增強它們在零次學習無法實現的專業生物醫學背景中的性能,例如醫療問題解答和生物醫學文獻的有效處理。最後,我們討論了 LLM 在生物醫學領域面臨的挑戰,包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰,我們還確定了生物醫學中 LLM 未來的研究方向,包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。 + +##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis** +2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone + +Significant investment and development have gone into integrating Artificial +Intelligence (AI) in medical and healthcare applications, leading to advanced +control systems in medical technology. However, the opacity of AI systems +raises concerns about essential characteristics needed in such sensitive +applications, like transparency and trustworthiness. Our study addresses these +concerns by investigating a process for selecting the most adequate Explainable +AI (XAI) methods to comply with the explanation requirements of key EU +regulations in the context of smart bioelectronics for medical devices. The +adopted methodology starts with categorising smart devices by their control +mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving +into their technology. Then, we analyse these regulations to define their +explainability requirements for the various devices and related goals. +Simultaneously, we classify XAI methods by their explanatory objectives. This +allows for matching legal explainability requirements with XAI explanatory +goals and determining the suitable XAI algorithms for achieving them. Our +findings provide a nuanced understanding of which XAI algorithms align better +with EU regulations for different types of medical devices. We demonstrate this +through practical case studies on different neural implants, from chronic +disease management to advanced prosthetics. This study fills a crucial gap in +aligning XAI applications in bioelectronics with stringent provisions of EU +regulations. It provides a practical framework for developers and researchers, +ensuring their AI innovations advance healthcare technology and adhere to legal +and ethical standards. -摘要:大型語言模型在許多任務中表現出色,包括文字轉 SQL,這歸功於它們強大的情境學習能力。它們正成為文字轉 SQL 的主流方法。然而,這些方法與人類的表現仍有顯著差距,特別是在複雜的問題上。隨著問題的複雜性增加,問題和 SQL 之間的差距也隨之增加。我們找出兩個重要的差距:結構對應差距和詞彙對應差距。為了解決這兩個差距,我們提出 PAS-SQL,一種基於 LLM 的高效 SQL 產生管道,它透過抽象查詢模式 (AQP) 和情境架構標記 (CSM) 來縮小差距。AQP 旨在透過移除與資料庫相關的資訊來取得問題的結構模式,這使我們能夠找到結構上相似的範例。CSM 旨在將問題中與資料庫相關的文字範圍與資料庫中的特定表格或欄位關聯起來,這可以縮小詞彙對應差距。在 Spider 和 BIRD 資料集上的實驗結果證明了我們所提出的方法的有效性。具體來說,PAS-SQL + GPT-4o 在 Spider 基準測試中設定了一個新的技術水準,執行準確度為 87.9%,並在 BIRD 資料集上取得領先的結果,執行準確度為 64.67%。 +摘要:人工智慧(AI)在醫療和保健應用中投入了大量的投資和開發,進而導致醫療技術中的先進控制系統。然而,AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂,例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題,用於選擇最充分的可解釋 AI(XAI)方法,以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制(開迴路、閉迴路和半閉迴路系統)對智慧型裝置進行分類,並深入探討其技術開始。然後,我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時,我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配,並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點,從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構,確保其 AI 創新能促進醫療技術並遵守法律和道德標準。 -##### **How to Get Your LLM to Generate Challenging Problems for Evaluation** -2502.14678v1 by Arkil Patel, Siva Reddy, Dzmitry Bahdanau +##### **Towards Case-based Interpretability for Medical Federated Learning** +2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva -The pace of evolution of Large Language Models (LLMs) necessitates new -approaches for rigorous and comprehensive evaluation. Traditional human -annotation is increasingly impracticable due to the complexities and costs -involved in generating high-quality, challenging problems. In this work, we -introduce CHASE, a unified framework to synthetically generate challenging -problems using LLMs without human involvement. For a given task, our approach -builds a hard problem in a bottom-up manner from simpler components. Moreover, -our framework decomposes the generation process into independently verifiable -sub-tasks, thereby ensuring a high level of quality and correctness. We -implement CHASE to create evaluation benchmarks across three diverse domains: -(1) document-based question answering, (2) repository-level code completion, -and (3) math reasoning. The performance of state-of-the-art LLMs on these -synthetic benchmarks lies in the range of 40-60% accuracy, thereby -demonstrating the effectiveness of our framework at generating challenging -problems. We publicly release our benchmarks and code. +We explore deep generative models to generate case-based explanations in a +medical federated learning setting. Explaining AI model decisions through +case-based interpretability is paramount to increasing trust and allowing +widespread adoption of AI in clinical practice. However, medical AI training +paradigms are shifting towards federated learning settings in order to comply +with data protection regulations. In a federated scenario, past data is +inaccessible to the current user. Thus, we use a deep generative model to +generate synthetic examples that protect privacy and explain decisions. Our +proof-of-concept focuses on pleural effusion diagnosis and uses publicly +available Chest X-ray data. -摘要:大型語言模型 (LLM) 的演化速度需要新的方法來進行嚴謹且全面的評估。由於產生高品質、具挑戰性的問題所涉及的複雜性和成本,傳統的人工標註正變得越來越不可行。在這項工作中,我們介紹了 CHASE,一個統一的框架,用於使用 LLM 合成產生具有挑戰性的問題,而無需人工參與。對於給定的任務,我們的做法是以自下而上的方式從更簡單的組成部分來建立一個困難的問題。此外,我們的框架將生成過程分解為獨立可驗證的子任務,從而確保高品質和正確性。我們實作 CHASE 來建立三個不同領域的評估基準:(1) 基於文件的問答、(2) 儲存庫層級的程式碼完成,以及 (3) 數學推理。最先進的 LLM 在這些合成基準上的效能落在 40-60% 的準確度範圍內,從而證明了我們的框架在產生具有挑戰性的問題上的有效性。我們公開發布我們的基準和程式碼。 +摘要:我們探索深度生成模型,在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策,對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而,醫療 AI 訓練範例正轉向聯邦學習設置,以符合資料保護法規。在聯邦情境中,過去的資料對目前的使用者而言是無法取得的。因此,我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷,並使用公開可取得的胸部 X 光資料。 -##### **Data-Constrained Synthesis of Training Data for De-Identification** -2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis +##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines** +2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans -Many sensitive domains -- such as the clinical domain -- lack widely -available datasets due to privacy risks. The increasing generative capabilities -of large language models (LLMs) have made synthetic datasets a viable path -forward. In this study, we domain-adapt LLMs to the clinical domain and -generate synthetic clinical texts that are machine-annotated with tags for -personally identifiable information using capable encoder-based NER models. The -synthetic corpora are then used to train synthetic NER models. The results show -that training NER models using synthetic corpora incurs only a small drop in -predictive performance. The limits of this process are investigated in a -systematic ablation study -- using both Swedish and Spanish data. Our analysis -shows that smaller datasets can be sufficient for domain-adapting LLMs for data -synthesis. Instead, the effectiveness of this process is almost entirely -contingent on the performance of the machine-annotating NER models trained -using the original data. +Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging +lesions with variable clinical behaviours and treatment approaches. This +systematic review provides an overview of Artificial Intelligence (AI) methods +using radiological imaging for diagnosis and prognosis of these tumours, +highlighting challenges in clinical translation, and evaluating study alignment +with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI +international consensus guidelines for trustworthy and deployable AI to promote +the clinical translation of AI methods. The review covered literature from +several bibliographic databases, including papers published before 17/07/2024. +Original research in peer-reviewed journals focused on radiology-based AI for +diagnosing or prognosing primary STBT was included. Exclusion criteria were +animal, cadaveric, or laboratory studies, and non-English papers. Abstracts +were screened by two of three independent reviewers for eligibility. Eligible +papers were assessed against guidelines by one of three independent reviewers. +The search identified 15,015 abstracts, from which 325 articles were included +for evaluation. Most studies performed moderately on CLAIM, averaging a score +of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out +of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage, +indicating significant room for improvement. Future efforts by AI developers +should focus on design (e.g. define unmet clinical need, intended clinical +setting and how AI would be integrated in clinical workflow), development (e.g. +build on previous work, explainability), evaluation (e.g. evaluating and +addressing biases, evaluating AI against best practices), and data +reproducibility and availability (making documented code and data publicly +available). Following these recommendations could improve clinical translation +of AI methods. -摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 +摘要:軟組織和骨骼腫瘤(STBT)是罕見、診斷具有挑戰性的病灶,其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀,重點說明了臨床轉譯的挑戰,並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性,以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻,包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究,以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要,其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等,平均得分為 53 分中的 28.9±7.5 分,但在 FUTURE-AI 中表現不佳,平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段,表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計(例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中)、開發(例如建立在先前的工作、可解釋性)、評估(例如評估和解決偏差、評估 AI 與最佳實務)、以及數據可複製性和可用性(公開提供文件化的代碼和數據)。遵循這些建議可以改善 AI 方法的臨床轉譯。 -##### **BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction** -2502.14676v1 by Ruochen Li, Stamos Katsigiannis, Tae-Kyun Kim, Hubert P. H. Shum +##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy** +2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen -Trajectory prediction allows better decision-making in applications of -autonomous vehicles or surveillance by predicting the short-term future -movement of traffic agents. It is classified into pedestrian or heterogeneous -trajectory prediction. The former exploits the relatively consistent behavior -of pedestrians, but is limited in real-world scenarios with heterogeneous -traffic agents such as cyclists and vehicles. The latter typically relies on -extra class label information to distinguish the heterogeneous agents, but such -labels are costly to annotate and cannot be generalized to represent different -behaviors within the same class of agents. In this work, we introduce the -behavioral pseudo-labels that effectively capture the behavior distributions of -pedestrians and heterogeneous agents solely based on their motion features, -significantly improving the accuracy of trajectory prediction. To implement the -framework, we propose the Behavioral Pseudo-Label Informed Sparse Graph -Convolution Network (BP-SGCN) that learns pseudo-labels and informs to a -trajectory predictor. For optimization, we propose a cascaded training scheme, -in which we first learn the pseudo-labels in an unsupervised manner, and then -perform end-to-end fine-tuning on the labels in the direction of increasing the -trajectory prediction accuracy. Experiments show that our pseudo-labels -effectively model different behavior clusters and improve trajectory -prediction. Our proposed BP-SGCN outperforms existing methods using both -pedestrian (ETH/UCY, pedestrian-only SDD) and heterogeneous agent datasets -(SDD, Argoverse 1). +Early detection of Cerebral Palsy (CP) is crucial for effective intervention +and monitoring. This paper tests the reliability and applicability of +Explainable AI (XAI) methods using a deep learning method that predicts CP by +analyzing skeletal data extracted from video recordings of infant movements. +Specifically, we use XAI evaluation metrics -- namely faithfulness and +stability -- to quantitatively assess the reliability of Class Activation +Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this +specific medical application. We utilize a unique dataset of infant movements +and apply skeleton data perturbations without distorting the original dynamics +of the infant movements. Our CP prediction model utilizes an ensemble approach, +so we evaluate the XAI metrics performances for both the overall ensemble and +the individual models. Our findings indicate that both XAI methods effectively +identify key body points influencing CP predictions and that the explanations +are robust against minor data perturbations. Grad-CAM significantly outperforms +CAM in the RISv metric, which measures stability in terms of velocity. In +contrast, CAM performs better in the RISb metric, which relates to bone +stability, and the RRS metric, which assesses internal representation +robustness. Individual models within the ensemble show varied results, and +neither CAM nor Grad-CAM consistently outperform the other, with the ensemble +approach providing a representation of outcomes from its constituent models. -摘要:軌跡預測允許在自動駕駛車輛或監視應用中做出更好的決策,藉由預測交通代理的短期未來移動。它被分類為行人或異質軌跡預測。前者利用行人相對一致的行為,但受限於與自行車騎士和車輛等異質交通代理的真實世界場景。後者通常依賴額外的類別標籤資訊來區分異質代理,但此類標籤的註解成本很高,且無法概括為表示同一類別代理中的不同行為。在這項工作中,我們引入了行為偽標籤,它僅根據行人和異質代理的運動特徵有效捕捉行為分佈,顯著提升軌跡預測的準確度。為實作架構,我們提出了行為偽標籤告知稀疏圖形卷積網路 (BP-SGCN),它學習偽標籤並告知軌跡預測器。針對最佳化,我們提出了一種串聯訓練方案,其中我們首先以非監督的方式學習偽標籤,然後在標籤上執行端到端微調,朝著提升軌跡預測準確度的方向進行。實驗顯示我們的偽標籤有效建模不同的行為叢集,並提升軌跡預測。我們提出的 BP-SGCN 使用行人 (ETH/UCY,僅限行人的 SDD) 和異質代理資料集 (SDD,Argoverse 1) 都優於現有方法。 +摘要:腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性,使用深度學習方法,透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說,我們使用 XAI 評估指標(即忠實度和穩定性)來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集,並應用骨骼資料擾動,而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法,因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明,兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位,並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM,該指標衡量速度方面的穩定性。相比之下,CAM 在 RISb 指標中表現得更好,該指標與骨骼穩定性有關,而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果,CAM 和 Grad-CAM 都不一致地優於另一種,整體方法提供了其組成模型結果的表示。 -##### **Explanations of Deep Language Models Explain Language Representations in the Brain** -2502.14671v1 by Maryam Rahimi, Yadollah Yaghoobzadeh, Mohammad Reza Daliri +##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy** +2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma -Recent advances in artificial intelligence have given rise to large language -models (LLMs) that not only achieve human-like performance but also share -computational principles with the brain's language processing mechanisms. While -previous research has primarily focused on aligning LLMs' internal -representations with neural activity, we introduce a novel approach that -leverages explainable AI (XAI) methods to forge deeper connections between the -two domains. Using attribution methods, we quantified how preceding words -contribute to an LLM's next-word predictions and employed these explanations to -predict fMRI recordings from participants listening to the same narratives. Our -findings demonstrate that attribution methods robustly predict brain activity -across the language network, surpassing traditional internal representations in -early language areas. This alignment is hierarchical: early-layer explanations -correspond to the initial stages of language processing in the brain, while -later layers align with more advanced stages. Moreover, the layers more -influential on LLM next-word prediction$\unicode{x2014}$those with higher -attribution scores$\unicode{x2014}$exhibited stronger alignment with neural -activity. This work establishes a bidirectional bridge between AI and -neuroscience. First, we demonstrate that attribution methods offer a powerful -lens for investigating the neural mechanisms of language comprehension, -revealing how meaning emerges from preceding context. Second, we propose using -brain alignment as a metric to evaluate the validity of attribution methods, -providing a framework for assessing their biological plausibility. +Recent global estimates suggest that as many as 2.41 billion individuals have +health conditions that would benefit from rehabilitation services. Home-based +Physical Therapy (PT) faces significant challenges in providing interactive +feedback and meaningful observation for therapists and patients. To fill this +gap, we present MicroXercise, which integrates micro-motion analysis with +wearable sensors, providing therapists and patients with a comprehensive +feedback interface, including video, text, and scores. Crucially, it employs +multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable +methods to analyze the existing deep learning neural networks in monitoring +exercises, focusing on a high granularity of exercise. This synergistic +approach is pivotal, providing output matching the input size to precisely +highlight critical subtleties and movements in PT, thus transforming complex AI +analysis into clear, actionable feedback. By highlighting these micro-motions +in different metrics, such as stability and range of motion, MicroXercise +significantly enhances the understanding and relevance of feedback for +end-users. Comparative performance metrics underscore its effectiveness over +traditional methods, such as a 39% and 42% improvement in Feature Mutual +Information (FMI) and Continuity. MicroXercise is a step ahead in home-based +physical therapy, providing a technologically advanced and intuitively helpful +solution to enhance patient care and outcomes. -摘要:最近的人工智能的進展產生了大型語言模型 (LLM),它不僅達到類似人類的表現,還與大腦的語言處理機制共享計算原理。雖然先前的研究主要集中於將 LLM 的內部表徵與神經活動對齊,但我們引入了一種新穎的方法,該方法利用可解釋 AI (XAI) 方法在兩個域之間建立更深層的聯繫。使用歸因方法,我們量化了前一個單詞如何促成 LLM 的下一個單詞預測,並利用這些解釋來預測參與者在聆聽相同敘述時的大腦功能性磁共振造影 (fMRI) 記錄。我們的發現表明,歸因方法可以穩健地預測整個語言網路中的大腦活動,超越了早期語言區域中的傳統內部表徵。這種對齊是分層的:早期層次解釋對應於大腦中語言處理的初始階段,而後續層次則與更進階的階段對齊。此外,對 LLM 下一個單詞預測影響力較大的層次(即歸因分數較高的層次)表現出與神經活動更強的對齊。這項工作在 AI 與神經科學之間建立了一個雙向橋樑。首先,我們證明歸因方法提供了一個強大的視角,用於研究語言理解的神經機制,揭示意義如何從先前的脈絡中產生。其次,我們建議使用大腦對齊作為評估歸因方法有效性的指標,提供了一個評估其生物學合理性的框架。 +摘要:最近的全球估計表明,多達 24.1 億人有 +健康狀況可從復健服務中受益。居家 +物理治療 (PT) 在提供互動式 +回饋和有意義的觀察方面面臨重大挑戰,供治療師和患者使用。為了填補這 +個缺口,我們提出 MicroXercise,它將微動作分析與 +可穿戴式感測器整合在一起,為治療師和患者提供一個全面的 +回饋介面,包括影片、文字和分數。至關重要的是,它採用 +多維動態時間規整 (DTW) 和基於歸因的可解釋 +方法來分析監控運動中現有的深度學習神經網路,專注於運動的高粒度。這種協同 +方法至關重要,提供與輸入大小匹配的輸出,以精確地 +突出 PT 中關鍵的細微差別和動作,從而將複雜的 AI +分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作,例如穩定性和動作範圍,MicroXercise +顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於 +傳統方法的有效性,例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家 +物理治療方面更進一步,提供技術先進且直覺有用的 +解決方案,以提升患者照護和結果。 -##### **AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO** -2502.14669v1 by Alan Dao, Dinh Bach Vu +##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development** +2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz -Large Language Models (LLMs) have demonstrated impressive capabilities in -language processing, yet they often struggle with tasks requiring genuine -visual spatial reasoning. In this paper, we introduce a novel two-stage -training framework designed to equip standard LLMs with visual reasoning -abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) -on a curated dataset of tokenized maze representations to teach the model to -predict step-by-step movement commands. Next, we apply Group Relative Policy -Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted -reward function to refine the model's sequential decision-making and encourage -emergent chain-of-thought behaviors. Experimental results on synthetically -generated mazes show that while a baseline model fails to navigate the maze, -the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning -boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more -robust and self-corrective reasoning, highlighting the potential of our -approach to bridge the gap between language models and visual spatial tasks. -These findings offer promising implications for applications in robotics, -autonomous navigation, and other domains that require integrated visual and -sequential reasoning. +Systematic literature reviews are the highest quality of evidence in +research. However, the review process is hindered by significant resource and +data constraints. The Literature Review Network (LRN) is the first of its kind +explainable AI platform adhering to PRISMA 2020 standards, designed to automate +the entire literature review process. LRN was evaluated in the domain of +surgical glove practices using 3 search strings developed by experts to query +PubMed. A non-expert trained all LRN models. Performance was benchmarked +against an expert manual review. Explainability and performance metrics +assessed LRN's ability to replicate the experts' review. Concordance was +measured with the Jaccard index and confusion matrices. Researchers were +blinded to the other's results until study completion. Overlapping studies were +integrated into an LRN-generated systematic review. LRN models demonstrated +superior classification accuracy without expert training, achieving 84.78% and +85.71% accuracy. The highest performance model achieved high interrater +reliability (k = 0.4953) and explainability metrics, linking 'reduce', +'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51% +of the relevant literature despite diverging from the non-expert's judgments (k += 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN +outperformed the manual review (19,920 minutes over 11 months), reducing the +entire process to 288.6 minutes over 5 days. This study demonstrates that +explainable AI does not require expert training to successfully conduct +PRISMA-compliant systematic literature reviews like an expert. LRN summarized +the results of surgical glove studies and identified themes that were nearly +identical to the clinical researchers' findings. Explainable AI can accurately +expedite our understanding of clinical practices, potentially revolutionizing +healthcare research. -摘要:大型語言模型(LLM)在語言處理方面展現出令人印象深刻的能力,但它們經常難以應付需要真正視覺空間推理的任務。在本文中,我們介紹了一種新穎的兩階段訓練架構,旨在為標準 LLM 提供迷宮導航的視覺推理能力。首先,我們在標記化迷宮表示的策展資料集上利用監督微調(SFT)來教導模型預測逐步移動指令。接下來,我們使用 DeepSeekR1 中使用的技術,即群體相對策略最佳化(GRPO),並搭配精心設計的獎勵函數來優化模型的順序決策制定,並鼓勵出現連貫的思考行為。在合成產生的迷宮上進行的實驗結果顯示,雖然基準模型無法導航迷宮,但經過 SFT 訓練的模型達到 86% 的準確度,而進一步的 GRPO 微調將準確度提升至 93%。定性分析顯示,GRPO 促進更強健且自我修正的推理,凸顯了我們的方法在彌合語言模型與視覺空間任務之間差距的潛力。這些發現為機器人、自主導航和其他需要整合視覺和順序推理的領域的應用提供了有希望的啟示。 +摘要:系統性文獻回顧是研究中證據品質最高的。然而,回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台,旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估,使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率,達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標,將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻,儘管與非專家的判斷不同 (k = 0.2174),但包含了「乳膠」、「雙重」(手套)和「適應症」等詞彙。LRN 優於手動回顧(11 個月超過 19,920 分鐘),將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示,可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果,並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解,有潛力革新醫療保健研究。 -##### **InstructAgent: Building User Controllable Recommender via LLM Agent** -2502.14662v1 by Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang +##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns** +2408.02709v1 by Chi Him Ng -Traditional recommender systems usually take the user-platform paradigm, -where users are directly exposed under the control of the platform's -recommendation algorithms. However, the defect of recommendation algorithms may -put users in very vulnerable positions under this paradigm. First, many -sophisticated models are often designed with commercial objectives in mind, -focusing on the platform's benefits, which may hinder their ability to protect -and capture users' true interests. Second, these models are typically optimized -using data from all users, which may overlook individual user's preferences. -Due to these shortcomings, users may experience several disadvantages under the -traditional user-platform direct exposure paradigm, such as lack of control -over the recommender system, potential manipulation by the platform, echo -chamber effects, or lack of personalization for less active users due to the -dominance of active users during collaborative learning. Therefore, there is an -urgent need to develop a new paradigm to protect user interests and alleviate -these issues. Recently, some researchers have introduced LLM agents to simulate -user behaviors, these approaches primarily aim to optimize platform-side -performance, leaving core issues in recommender systems unresolved. To address -these limitations, we propose a new user-agent-platform paradigm, where agent -serves as the protective shield between user and recommender system that -enables indirect exposure. To this end, we first construct four recommendation -datasets, denoted as $\dataset$, along with user instructions for each record. +This study analyzes hybrid AI systems' design patterns and their +effectiveness in clinical decision-making using the boxology framework. It +categorizes and copares various architectures combining machine learning and +rule-based reasoning to provide insights into their structural foundations and +healthcare applications. Addressing two main questions, how to categorize these +systems againts established design patterns and how to extract insights through +comparative analysis, the study uses design patterns from software engineering +to understand and optimize healthcare AI systems. Boxology helps identify +commonalities and create reusable solutions, enhancing these systems' +scalability, reliability, and performance. Five primary architectures are +examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and +weaknesses, highlighting the need for tailored approaches in clinical tasks. +REML excels in high-accuracy prediction for datasets with limited data; MLRB in +handling large datasets and complex data integration; RBML in explainability +and trustworthiness; RMLT in managing high-dimensional data; and PERML, though +limited in analysis, shows promise in urgent care scenarios. The study +introduces four new patterns, creates five abstract categorization patterns, +and refines those five further to specific systems. These contributions enhance +Boxlogy's taxonomical organization and offer novel approaches to integrating +expert knowledge with machine learning. Boxology's structured, modular apporach +offers significant advantages in developing and analyzing hybrid AI systems, +revealing commonalities, and promoting reusable solutions. In conclusion, this +study underscores hybrid AI systems' crucial role in advancing healthcare and +Boxology's potential to drive further innovation in AI integration, ultimately +improving clinical decision support and patient outcomes. -摘要:傳統推薦系統通常採用使用者-平台範例, -其中使用者直接暴露在平台推薦演算法的控制之下。然而,推薦演算法的缺陷可能會讓使用者在這個範例中處於非常脆弱的位置。首先,許多精密的模型通常在設計時就考慮到商業目標,專注於平台的利益,這可能會阻礙它們保護和掌握使用者真正興趣的能力。其次,這些模型通常使用所有使用者的資料進行最佳化,這可能會忽略個別使用者的偏好。由於這些缺點,使用者可能會在傳統使用者-平台直接暴露範例中遇到一些缺點,例如缺乏對推薦系統的控制、平台的潛在操縱、同溫層效應,或由於活躍使用者在協作學習中的主導地位而缺乏針對較不活躍使用者的個人化。因此,迫切需要開發一種新的範例來保護使用者利益並緩解這些問題。最近,一些研究人員引入了 LLM 代理程式來模擬使用者行為,這些方法主要旨在最佳化平台端的效能,而未解決推薦系統中的核心問題。為了解決這些限制,我們提出了一種新的使用者-代理程式-平台範例,其中代理程式作為使用者和推薦系統之間的保護盾,實現間接暴露。為此,我們首先構建了四個推薦資料集,表示為 $\dataset$,以及每條記錄的使用者說明。 +摘要:本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構,以深入了解其結構基礎和醫療保健應用。針對兩個主要問題,如何根據既定的設計模式對這些系統進行分類,以及如何通過比較分析提取見解,本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案,從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構:REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點,強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測;MLRB 在處理大型資料集和複雜資料整合方面表現出色;RBML 在可解釋性和可信度方面表現出色;RMLT 在管理高維資料方面表現出色;而 PERML 儘管在分析方面有限,但在緊急照護場景中表現出潛力。本研究引入了四種新模式,建立了五種抽象分類模式,並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織,並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之,本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用,以及盒子學在推動人工智慧整合進一步創新方面的潛力,最終改善臨床決策支援和患者的治療成果。 -##### **Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs** -2502.14645v1 by Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao +##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability** +2408.02706v1 by Masoud Muhammed Hassan -Knowledge editing allows for efficient adaptation of large language models -(LLMs) to new information or corrections without requiring full retraining. -However, prior methods typically focus on either single-language editing or -basic multilingual editing, failing to achieve true cross-linguistic knowledge -synchronization. To address this, we present a simple and practical -state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), -designed to propagate knowledge from a dominant language to other languages -effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition -Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel -dataset to modify in-scope knowledge while preserving unrelated information, -and (ii) Target-language Preference Optimization (TL-PO), which applies -advanced optimization techniques to ensure consistency across languages, -fostering the transfer of updates. Additionally, we contribute a high-quality, -cross-lingual dataset, specifically designed to enhance knowledge transfer -across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks -show that X-KDE significantly enhances cross-lingual performance, achieving an -average improvement of +8.19%, while maintaining high accuracy in monolingual -settings. +Because of its strong predictive skills, deep learning has emerged as an +essential tool in many industries, including healthcare. Traditional deep +learning models, on the other hand, frequently lack interpretability and omit +to take prediction uncertainty into account two crucial components of clinical +decision making. In order to produce explainable and uncertainty aware +predictions, this study presents a novel framework called Bayesian Kolmogorov +Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov +Arnold Networks with Bayesian inference. We employ BKANs on two medical +datasets, which are widely used benchmarks for assessing machine learning +models in medical diagnostics: the Pima Indians Diabetes dataset and the +Cleveland Heart Disease dataset. Our method provides useful insights into +prediction confidence and decision boundaries and outperforms traditional deep +learning models in terms of prediction accuracy. Moreover, BKANs' capacity to +represent aleatoric and epistemic uncertainty guarantees doctors receive more +solid and trustworthy decision support. Our Bayesian strategy improves the +interpretability of the model and considerably minimises overfitting, which is +important for tiny and imbalanced medical datasets, according to experimental +results. We present possible expansions to further use BKANs in more +complicated multimodal datasets and address the significance of these +discoveries for future research in building reliable AI systems for healthcare. +This work paves the way for a new paradigm in deep learning model deployment in +vital sectors where transparency and reliability are crucial. -摘要:知識編輯允許大語言模型 (LLM) 有效地適應新資訊或修正,而無需進行完整的再訓練。 -然而,先前的做法通常專注於單一語言編輯或基本的語音編輯,未能實現真正的跨語言知識同步。為了解決這個問題,我們提出了一個簡單且實用的最先進 (SOTA) 配方,即跨語言知識民主編輯 (X-KDE),旨在有效地從主導語言傳播知識到其他語言。我們的 X-KDE 包含兩個階段:(i) 跨語言版本指令調整 (XE-IT),它微調模型,在經過整理的平行資料集上修改範圍內的知識,同時保留不相關的資訊,以及 (ii) 目標語言偏好最佳化 (TL-PO),它應用先進的最佳化技術,以確保跨語言的一致性,促進更新的傳輸。此外,我們貢獻了一個高品質的跨語言資料集,特別設計用於增強跨語言的知識傳輸。在 Bi-ZsRE 和 MzsRE 基準上的廣泛實驗表明,X-KDE 大幅提升了跨語言效能,在單語言設定中維持高準確度的同時,平均提升了 +8.19%。 +摘要:由於其強大的預測能力,深度學習已成為許多產業中不可或缺的工具,包括醫療保健。然而,傳統的深度學習模型通常缺乏可解釋性,並且忽略了將預測不確定性納入考量,而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測,本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構,它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN,這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準:皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解,並且在預測準確度方面優於傳統的深度學習模型。此外,BKAN 表現隨機和認識不確定性的能力,可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果,我們的貝氏策略提高了模型的可解釋性,並大幅減少了過度擬合,這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能,以進一步將 BKAN 用於更複雜的多模式資料集,並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。 -##### **LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning** -2502.14644v1 by Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang +##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI** +2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal -Long context understanding remains challenging for large language models due -to their limited context windows. This paper presents Long Input Fine-Tuning -(LIFT), a novel framework for long-context modeling that can improve the -long-context performance of arbitrary (short-context) LLMs by dynamically -adapting model parameters based on the long input. Importantly, LIFT, rather -than endlessly extending the context window size to accommodate increasingly -longer inputs in context, chooses to store and absorb the long input in -parameter. By fine-tuning the long input into model parameters, LIFT allows -short-context LLMs to answer questions even when the required information is -not provided in the context during inference. Furthermore, to enhance LIFT -performance while maintaining the original in-context learning (ICL) -capabilities, we introduce Gated Memory, a specialized attention adapter that -automatically balances long input memorization and ICL. We provide a -comprehensive analysis of the strengths and limitations of LIFT on long context -understanding, offering valuable directions for future research. +In modern healthcare, addressing the complexities of accurate disease +prediction and personalized recommendations is both crucial and challenging. +This research introduces MLtoGAI, which integrates Semantic Web technology with +Machine Learning (ML) to enhance disease prediction and offer user-friendly +explanations through ChatGPT. The system comprises three key components: a +reusable disease ontology that incorporates detailed knowledge about various +diseases, a diagnostic classification model that uses patient symptoms to +detect specific diseases accurately, and the integration of Semantic Web Rule +Language (SWRL) with ontology and ChatGPT to generate clear, personalized +health advice. This approach significantly improves prediction accuracy and +ensures results that are easy to understand, addressing the complexity of +diseases and diverse symptoms. The MLtoGAI system demonstrates substantial +advancements in accuracy and user satisfaction, contributing to developing more +intelligent and accessible healthcare solutions. This innovative approach +combines the strengths of ML algorithms with the ability to provide +transparent, human-understandable explanations through ChatGPT, achieving +significant improvements in prediction accuracy and user comprehension. By +leveraging semantic technology and explainable AI, the system enhances the +accuracy of disease prediction and ensures that the recommendations are +relevant and easily understood by individual patients. Our research highlights +the potential of integrating advanced technologies to overcome existing +challenges in medical diagnostics, paving the way for future developments in +intelligent healthcare systems. Additionally, the system is validated using 200 +synthetic patient data records, ensuring robust performance and reliability. -摘要:由於大型語言模型的上下文視窗有限,因此對於它們而言,長語境理解仍然具有挑戰性。本文提出了長輸入微調 (LIFT),這是一個用於長語境建模的新穎架構,它可以通過根據長輸入動態調整模型參數來改善任意(短語境)LLM 的長語境效能。重要的是,LIFT 沒有無限擴充上下文視窗大小以容納語境中越來越長的輸入,而是選擇將長輸入儲存在參數中並吸收它。通過將長輸入微調到模型參數中,LIFT 允許短語境 LLM 回答問題,即使在推理期間語境中沒有提供所需資訊也是如此。此外,為了在保持原始語境中學習 (ICL) 能力的同時增強 LIFT 效能,我們引入了閘控記憶體,這是一個自動平衡長輸入記憶和 ICL 的特殊注意力適配器。我們對 LIFT 在長語境理解方面的優缺點進行了全面的分析,為未來的研究提供了有價值的方向。 +摘要:在現代醫療保健中,解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI,它將語義網路技術與機器學習 (ML) 相結合,以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分:一個可重複使用的疾病本体,其中包含有關各種疾病的詳細知識;一個診斷分類模型,它使用患者症狀來準確檢測特定疾病;以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合,以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性,並確保了易於理解的結果,解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步,有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點,以及透過 ChatGPT 提供透明且人類可以理解的說明的能力,在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI,該系統提高了疾病預測的準確性,並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力,為智慧醫療保健系統的未來發展鋪路。此外,該系統使用 200 個合成患者資料記錄進行驗證,確保了穩健的效能和可靠性。 -##### **Length-Controlled Margin-Based Preference Optimization without Reference Model** -2502.14643v1 by Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu +##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations** +2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora -Direct Preference Optimization (DPO) is a widely adopted offline algorithm -for preference-based reinforcement learning from human feedback (RLHF), -designed to improve training simplicity and stability by redefining reward -functions. However, DPO is hindered by several limitations, including length -bias, memory inefficiency, and probability degradation. To address these -challenges, we propose Length-Controlled Margin-Based Preference Optimization -(LMPO), a more efficient and robust alternative. LMPO introduces a uniform -reference model as an upper bound for the DPO loss, enabling a more accurate -approximation of the original optimization objective. Additionally, an average -log-probability optimization strategy is employed to minimize discrepancies -between training and inference phases. A key innovation of LMPO lies in its -Length-Controlled Margin-Based loss function, integrated within the -Bradley-Terry framework. This loss function regulates response length while -simultaneously widening the margin between preferred and rejected outputs. By -doing so, it mitigates probability degradation for both accepted and discarded -responses, addressing a significant limitation of existing methods. We evaluate -LMPO against state-of-the-art preference optimization techniques on two -open-ended large language models, Mistral and LLaMA3, across six conditional -benchmarks. Our experimental results demonstrate that LMPO effectively controls -response length, reduces probability degradation, and outperforms existing -approaches. The code is available at \url{https://github.com/gengxuli/LMPO}. +Explainable Artificial Intelligence (XAI) is central to the debate on +integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms +into clinical practice. High-performing AI/ML models, such as ensemble learners +and deep neural networks, often lack interpretability, hampering clinicians' +trust in their predictions. To address this, XAI techniques are being developed +to describe AI/ML predictions in human-understandable terms. One promising +direction is the adaptation of sensitivity analysis (SA) and global sensitivity +analysis (GSA), which inherently rank model inputs by their impact on +predictions. Here, we introduce a novel delta-XAI method that provides local +explanations of ML model predictions by extending the delta index, a GSA +metric. The delta-XAI index assesses the impact of each feature's value on the +predicted output for individual instances in both regression and classification +problems. We formalize the delta-XAI index and provide code for its +implementation. The delta-XAI method was evaluated on simulated scenarios using +linear regression models, with Shapley values serving as a benchmark. Results +showed that the delta-XAI index is generally consistent with Shapley values, +with notable discrepancies in models with highly impactful or extreme feature +values. The delta-XAI index demonstrated higher sensitivity in detecting +dominant features and handling extreme feature values. Qualitatively, the +delta-XAI provides intuitive explanations by leveraging probability density +functions, making feature rankings clearer and more explainable for +practitioners. Overall, the delta-XAI method appears promising for robustly +obtaining local explanations of ML model predictions. Further investigations in +real-world clinical settings will be conducted to evaluate its impact on +AI-assisted clinical workflows. -摘要:直接偏好優化 (DPO) 是一種廣泛採用的離線演算法,用於從人類回饋 (RLHF) 中進行基於偏好的強化學習,旨在透過重新定義獎勵函數來提升訓練的簡潔性和穩定性。然而,DPO 受到若干限制的阻礙,包括長度偏差、記憶體效率低下和機率下降。為了解決這些挑戰,我們提出長度控制邊際偏好優化 (LMPO),一種更有效率且穩健的替代方案。LMPO 引入統一參考模型作為 DPO 損失的上限,能夠更準確地近似原始最佳化目標。此外,採用平均對數機率最佳化策略來最小化訓練和推論階段之間的差異。LMPO 的一項關鍵創新在於其長度控制邊際損失函數,整合在 Bradley-Terry 架構中。此損失函數調節回應長度,同時擴大偏好和拒絕輸出之間的邊際。藉由這麼做,它減輕了已接受和已捨棄回應的機率下降,解決了現有方法的重大限制。我們在兩個開放式大型語言模型 Mistral 和 LLaMA3 上,針對六個條件基準,評估 LMPO 與最先進的偏好優化技術。我們的實驗結果證明,LMPO 有效控制回應長度,減少機率下降,並優於現有方法。程式碼可在 \url{https://github.com/gengxuli/LMPO} 取得。 +摘要:可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型,例如整體學習器和深度神經網路,通常缺乏可解釋性,阻礙臨床醫生對其預測的信任。為了解決這個問題,正在開發 XAI 技術,以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA),它們本質上會依據模型輸入對預測的影響來對其進行排名。在此,我們介紹一種新的 delta-XAI 方法,透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化,並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法,並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致,但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說,delta-XAI 透過利用機率密度函數提供直觀的解釋,使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言,delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查,以評估其對 AI 輔助臨床工作流程的影響。 -##### **How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation** -2502.14642v1 by Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui +##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population** +2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis -Recently, LLMs have garnered increasing attention across academic disciplines -for their potential as human digital twins, virtual proxies designed to -replicate individuals and autonomously perform tasks such as decision-making, -problem-solving, and reasoning on their behalf. However, current evaluations of -LLMs primarily emphasize dialogue simulation while overlooking human behavior -simulation, which is crucial for digital twins. To address this gap, we -introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to -simulate continuous human behavior. BehaviorChain comprises diverse, -high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors -across 1,001 unique personas, each with detailed history and profile metadata. -For evaluation, we integrate persona metadata into LLMs and employ them to -iteratively infer contextually appropriate behaviors within dynamic scenarios -provided by BehaviorChain. Comprehensive evaluation results demonstrated that -even state-of-the-art models struggle with accurately simulating continuous -human behavior. +Dementia, a debilitating neurological condition affecting millions worldwide, +presents significant diagnostic challenges. In this work, we introduce a novel +methodology for the classification of demented and non-demented elderly +patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach +features a unique technique for selectively processing MRI slices, focusing on +the most relevant brain regions and excluding less informative sections. This +methodology is complemented by a confidence-based classification committee +composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and +Dem3D EfficientNet. These models work synergistically to enhance +decision-making accuracy, leveraging their collective strengths. Tested on the +Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an +impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, +validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset +confirmed the robustness and generalizability of our approach. The use of +explainable AI (XAI) techniques and comprehensive ablation studies further +substantiate the effectiveness of our techniques, providing insights into the +decision-making process and the importance of our methodology. This research +offers a significant advancement in dementia diagnosis, providing a highly +accurate and efficient tool for clinical applications. + +摘要:失智症是一種影響全球數百萬人的衰弱性神經疾病,在診斷上具有重大挑戰。在這項工作中,我們提出了一種新的方法,用於對失智和非失智老年患者進行分類,使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術,用於選擇性處理 MRI 切片,重點關注最相關的大腦區域,並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充,該委員會由三個自定義深度學習模型組成:Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性,利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試,我們的模型達到了 94.12% 的驚人準確度,超過了現有方法。此外,在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性,提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展,為臨床應用提供了一個高度準確且高效的工具。 + +##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition** +2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini + +Recognizing daily activities with unobtrusive sensors in smart environments +enables various healthcare applications. Monitoring how subjects perform +activities at home and their changes over time can reveal early symptoms of +health issues, such as cognitive decline. Most approaches in this field use +deep learning models, which are often seen as black boxes mapping sensor data +to activities. However, non-expert users like clinicians need to trust and +understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human +Activity Recognition have emerged to provide intuitive natural language +explanations from these models. Different XAI methods generate different +explanations, and their effectiveness is typically evaluated through user +surveys, that are often challenging in terms of costs and fairness. This paper +proposes an automatic evaluation method using Large Language Models (LLMs) to +identify, in a pool of candidates, the best XAI approach for non-expert users. +Our preliminary results suggest that LLM evaluation aligns with user surveys. -摘要:最近,LLM 在各個學科中備受關注,因為它們具有作為人類數位雙胞胎的潛力,也就是虛擬代理人,旨在複製個人並自主執行任務,例如代表他們進行決策、解決問題和推理。然而,LLM 目前的評估主要強調對話模擬,同時忽視了人類行為模擬,這對數位雙胞胎至關重要。為了解決這個差距,我們引入了 BehaviorChain,這是第一個用於評估 LLM 模擬連續人類行為能力的基準。BehaviorChain 包含多樣化、高品質、基於角色的行為鏈,總共涵蓋 1,001 個獨特角色的 15,846 種不同行為,每個角色都有詳細的歷史和個人資料元數據。在評估中,我們將角色元數據整合到 LLM 中,並使用它們在 BehaviorChain 提供的動態場景中反覆推斷出在情境中適當的行為。全面的評估結果表明,即使是最先進的模型在準確模擬連續人類行為方面也存在困難。 +摘要:藉由智慧環境中不引人注目的感測器辨識日常活動,能啟用各種醫療保健應用。監控受試者在家中如何執行活動,以及其隨著時間的變化,可以揭示健康問題的早期症狀,例如認知能力下降。此領域中的大多數方法都使用深度學習模型,這些模型通常被視為將感測器資料對應至活動的黑盒子。然而,非專家使用者(例如臨床醫師)需要信任並了解這些模型的輸出。因此,人類活動辨識的可解釋 AI (XAI) 方法應運而生,以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明,而其有效性通常透過使用者調查來評估,這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法,以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明,LLM 評估與使用者調查一致。 -##### **NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization** -2502.14638v1 by Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber +##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions** +2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil -Image geo-localization is the task of predicting the specific location of an -image and requires complex reasoning across visual, geographical, and cultural -contexts. While prior Vision Language Models (VLMs) have the best accuracy at -this task, there is a dearth of high-quality datasets and models for analytical -reasoning. We first create NaviClues, a high-quality dataset derived from -GeoGuessr, a popular geography game, to supply examples of expert reasoning -from language. Using this dataset, we present Navig, a comprehensive image -geo-localization framework integrating global and fine-grained image -information. By reasoning with language, Navig reduces the average distance -error by 14% compared to previous state-of-the-art models while requiring fewer -than 1000 training samples. Our dataset and code are available at -https://github.com/SparrowZheyuan18/Navig/. +Industry 5.0, which focuses on human and Artificial Intelligence (AI) +collaboration for performing different tasks in manufacturing, involves a +higher number of robots, Internet of Things (IoTs) devices and +interconnections, Augmented/Virtual Reality (AR), and other smart devices. The +huge involvement of these devices and interconnection in various critical +areas, such as economy, health, education and defense systems, poses several +types of potential security flaws. AI itself has been proven a very effective +and powerful tool in different areas of cybersecurity, such as intrusion +detection, malware detection, and phishing detection, among others. Just as in +many application areas, cybersecurity professionals were reluctant to accept +black-box ML solutions for cybersecurity applications. This reluctance pushed +forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool +that helps explain how decisions are made in ML-based systems. In this survey, +we present a comprehensive study of different XAI-based intrusion detection +systems for industry 5.0, and we also examine the impact of explainability and +interpretability on Cybersecurity practices through the lens of Adversarial +XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities +and challenges in XAI cybersecurity systems for industry 5.0 that elicit future +research toward XAI-based solutions to be adopted by high-stakes industry 5.0 +applications. We believe this rigorous analysis will establish a foundational +framework for subsequent research endeavors within the specified domain. -摘要:影像地理定位是預測影像特定位置的任務,需要跨視覺、地理和文化脈絡進行複雜的推理。雖然先前的視覺語言模型 (VLM) 在此任務中擁有最佳準確度,但缺乏高品質的資料集和分析推理模型。我們首先建立 NaviClues,這是一個源自 GeoGuessr 的高品質資料集,GeoGuessr 是一款流行的地理遊戲,可提供來自語言的專家推理範例。使用此資料集,我們提出 Navig,這是一個綜合性的影像地理定位架構,整合了全球和細緻的影像資訊。透過語言推理,Navig 將平均距離誤差減少了 14%,與先前的最先進模型相比,同時只需要不到 1000 個訓練樣本。我們的資料集和程式碼可在 https://github.com/SparrowZheyuan18/Navig/ 取得。 +摘要:工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務,涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與,引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具,例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣,網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用,有助於說明在基於 ML 的系統中如何做出決策。在這項調查中,我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究,並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外,我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰,引發了未來針對 XAI 基礎解決方案的研究,以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。 -##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** -2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu +##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability** +2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic -Protein backbone generation plays a central role in de novo protein design -and is significant for many biological and medical applications. Although -diffusion and flow-based generative models provide potential solutions to this -challenging task, they often generate proteins with undesired designability and -suffer computational inefficiency. In this study, we propose a novel rectified -quaternion flow (ReQFlow) matching method for fast and high-quality protein -backbone generation. In particular, our method generates a local translation -and a 3D rotation from random noise for each residue in a protein chain, which -represents each 3D rotation as a unit quaternion and constructs its flow by -spherical linear interpolation (SLERP) in an exponential format. We train the -model by quaternion flow (QFlow) matching with guaranteed numerical stability -and rectify the QFlow model to accelerate its inference and improve the -designability of generated protein backbones, leading to the proposed ReQFlow -model. Experiments show that ReQFlow achieves state-of-the-art performance in -protein backbone generation while requiring much fewer sampling steps and -significantly less inference time (e.g., being 37x faster than RFDiffusion and -62x faster than Genie2 when generating a backbone of length 300), demonstrating -its effectiveness and efficiency. The code is available at -https://github.com/AngxiaoYue/ReQFlow. +This study aims to explore the implementation of Natural Language Processing +(NLP) and machine learning (ML) techniques to automate the coding of medical +letters with visualised explainability and light-weighted local computer +settings. Currently in clinical settings, coding is a manual process that +involves assigning codes to each condition, procedure, and medication in a +patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There +are preliminary research on automatic coding in this field using +state-of-the-art ML models; however, due to the complexity and size of the +models, the real-world deployment is not achieved. To further facilitate the +possibility of automatic coding practice, we explore some solutions in a local +computer setting; in addition, we explore the function of explainability for +transparency of AI models. We used the publicly available MIMIC-III database +and the HAN/HLAN network models for ICD code prediction purposes. We also +experimented with the mapping between ICD and SNOMED CT knowledge bases. In our +experiments, the models provided useful information for 97.98\% of codes. The +result of this investigation can shed some light on implementing automatic +clinical coding in practice, such as in hospital settings, on the local +computers used by clinicians , project page +\url{https://github.com/Glenj01/Medical-Coding}. -摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 +摘要:本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化,並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中,編碼是一種手動流程,涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如,使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究;然而,由於模型的複雜性和大小,並未實現實際部署。為了進一步促進自動編碼實務的可能性,我們在本地電腦設定中探討了一些解決方案;此外,我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中,這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解,例如在醫院環境中,由臨床醫生使用的本地電腦,專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。 -##### **PEARL: Towards Permutation-Resilient LLMs** -2502.14628v1 by Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong +##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation** +2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier -The in-context learning (ICL) capability of large language models (LLMs) -enables them to perform challenging tasks using provided demonstrations. -However, ICL is highly sensitive to the ordering of demonstrations, leading to -instability in predictions. This paper shows that this vulnerability can be -exploited to design a natural attack - difficult for model providers to detect -- that achieves nearly 80% success rate on LLaMA-3 by simply permuting the -demonstrations. Existing mitigation methods primarily rely on post-processing -and fail to enhance the model's inherent robustness to input permutations, -raising concerns about safety and reliability of LLMs. To address this issue, -we propose Permutation-resilient learning (PEARL), a novel framework based on -distributionally robust optimization (DRO), which optimizes model performance -against the worst-case input permutation. Specifically, PEARL consists of a -permutation-proposal network (P-Net) and the LLM. The P-Net generates the most -challenging permutations by treating it as an optimal transport problem, which -is solved using an entropy-constrained Sinkhorn algorithm. Through minimax -optimization, the P-Net and the LLM iteratively optimize against each other, -progressively improving the LLM's robustness. Experiments on synthetic -pre-training and real-world instruction tuning tasks demonstrate that PEARL -effectively mitigates permutation attacks and enhances performance. Notably, -despite being trained on fewer shots and shorter contexts, PEARL achieves -performance gains of up to 40% when scaled to many-shot and long-context -scenarios, highlighting its efficiency and generalization capabilities. +The support of artificial intelligence (AI) based decision-making is a key +element in future 6G networks, where the concept of native AI will be +introduced. Moreover, AI is widely employed in different critical applications +such as autonomous driving and medical diagnosis. In such applications, using +AI as black-box models is risky and challenging. Hence, it is crucial to +understand and trust the decisions taken by these models. Tackling this issue +can be achieved by developing explainable AI (XAI) schemes that aim to explain +the logic behind the black-box model behavior, and thus, ensure its efficient +and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST +framework that is oriented toward channel estimation in wireless +communications. The core idea of the XAI-CHEST framework is to identify the +relevant model inputs by inducing high noise on the irrelevant ones. This +manuscript provides the detailed theoretical foundations of the XAI-CHEST +framework. In particular, we derive the analytical expressions of the XAI-CHEST +loss functions and the noise threshold fine-tuning optimization problem. Hence +the designed XAI-CHEST delivers a smart input feature selection methodology +that can further improve the overall performance while optimizing the +architecture of the employed model. Simulation results show that the XAI-CHEST +framework provides valid interpretations, where it offers an improved bit error +rate performance while reducing the required computational complexity in +comparison to the classical DL-based channel estimation. -摘要:大型語言模型 (LLM) 的語境學習 (ICL) 能力使其能夠透過提供的示範來執行具有挑戰性的任務。然而,ICL 對示範的排序非常敏感,導致預測不穩定。本文顯示,可以利用此漏洞來設計一種自然攻擊,讓模型提供者難以偵測,透過簡單地排列示範,在 LLaMA-3 上達到近 80% 的成功率。現有的緩解方法主要依賴後處理,且無法增強模型對輸入排列的固有穩健性,引發了對 LLM 的安全性與可靠性的疑慮。為了解決此問題,我們提出了一種基於分配穩健最佳化 (DRO) 的新型架構,稱為排列彈性學習 (PEARL),它針對最差情況的輸入排列來最佳化模型效能。具體來說,PEARL 包含排列建議網路 (P-Net) 和 LLM。P-Net 將其視為最優傳輸問題來產生最具挑戰性的排列,並使用熵約束 Sinkhorn 演算法來解決。透過極小極大最佳化,P-Net 和 LLM 迭代地相互最佳化,逐步改善 LLM 的穩健性。在合成預訓練和真實世界指令調整任務上的實驗證明,PEARL 有效地減輕了排列攻擊並增強了效能。值得注意的是,儘管在較少的次數和較短的語境中進行訓練,但 PEARL 在擴展到多重次數和長語境場景時仍可獲得高達 40% 的效能提升,突顯了其效率和泛化能力。 +摘要:人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素,其中將引入原生 AI 的概念。此外,AI 廣泛用於不同的關鍵應用中,例如自動駕駛和醫療診斷。在這些應用中,使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此,理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構,旨在解釋黑盒模型行為背後的邏輯,從而確保其有效且安全的部署。最近,我們提出了一個新的基於擾動的 XAI-CHEST 框架,該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是,我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此,設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法,可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明,XAI-CHEST 框架提供了有效的解釋,在降低所需的計算複雜度的同時,提供了改進的比特錯誤率性能,而這與基於傳統 DL 的信道估計相比。 -##### **ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors** -2502.14627v1 by Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou +##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification** +2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman -Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to -retrieve audio clips or multilingual texts from databases. However, existing -ML-ATR schemes suffer from inconsistencies for instance similarity matching -across languages. We theoretically analyze the inconsistency in terms of both -multilingual modal alignment direction error and weight error, and propose the -theoretical weight error upper bound for quantifying the inconsistency. Based -on the analysis of the weight error upper bound, we find that the inconsistency -problem stems from the data distribution error caused by random sampling of -languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive -learning and audio-English co-anchor contrastive learning, aiming to mitigate -the negative impact of data distribution error on recall and consistency in -ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets -show that our scheme achieves state-of-the-art performance on recall and -consistency metrics for eight mainstream languages, including English. Our code -will be available at https://github.com/ATRI-ACL/ATRI-ACL. +This paper presents dilated Residual Network (ResNet) models for disease +classification from retinal fundus images. Dilated convolution filters are used +to replace normal convolution filters in the higher layers of the ResNet model +(dilated ResNet) in order to improve the receptive field compared to the normal +ResNet model for disease classification. This study introduces +computer-assisted diagnostic tools that employ deep learning, enhanced with +explainable AI techniques. These techniques aim to make the tool's +decision-making process transparent, thereby enabling medical professionals to +understand and trust the AI's diagnostic decision. They are particularly +relevant in today's healthcare landscape, where there is a growing demand for +transparency in AI applications to ensure their reliability and ethical use. +The dilated ResNet is used as a replacement for the normal ResNet to enhance +the classification accuracy of retinal eye diseases and reduce the required +computing time. The dataset used in this work is the Ocular Disease Intelligent +Recognition (ODIR) dataset which is a structured ophthalmic database with eight +classes covering most of the common retinal eye diseases. The evaluation +metrics used in this work include precision, recall, accuracy, and F1 score. In +this work, a comparative study has been made between normal ResNet models and +dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, +ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as +compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, +and 0.70 respectively for the above respective variants in ODIR multiclass +disease classification. -摘要:多模態多語言音訊文字檢索 (ML-ATR) 是一項具有挑戰性的任務,旨在從資料庫中檢索音訊片段或多語言文字。然而,現有的 ML-ATR 架構存在不一致的情況,例如跨語言的相似性比對。我們在理論上分析了不一致性,包括多模態多語言對齊方向誤差和權重誤差,並提出理論權重誤差上限以量化不一致性。根據權重誤差上限的分析,我們發現不一致性問題源於由語言隨機取樣造成的資料分佈誤差。我們提出一個一致的 ML-ATR 架構,採用 1 對 k 對比學習和音訊-英語共同錨點對比學習,旨在減輕資料分佈誤差對 ML-ATR 中召回率和一致性的負面影響。在已翻譯的 AudioCaps 和 Clotho 資料集上的實驗結果顯示,我們的架構在包括英語在內的八種主流語言的召回率和一致性指標上達到了最先進的效能。我們的程式碼將在 https://github.com/ATRI-ACL/ATRI-ACL 中提供。 +摘要:这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器(扩张 ResNet),以改善感知场,从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具,并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化,从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关,在该领域,对 AI 应用的透明度需求不断增长,以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品,以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集,这是一个结构化的眼科数据库,包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中,对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比,扩张 ResNet 模型显示出有希望的结果,在 ODIR 多类疾病分类中,上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。 -##### **Multi-Record Web Page Information Extraction From News Websites** -2502.14625v1 by Alexander Kustenkov, Maksim Varlamov, Alexander Yatskov +##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis** +2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li -In this paper, we focused on the problem of extracting information from web -pages containing many records, a task of growing importance in the era of -massive web data. Recently, the development of neural network methods has -improved the quality of information extraction from web pages. Nevertheless, -most of the research and datasets are aimed at studying detailed pages. This -has left multi-record "list pages" relatively understudied, despite their -widespread presence and practical significance. - To address this gap, we created a large-scale, open-access dataset -specifically designed for list pages. This is the first dataset for this task -in the Russian language. Our dataset contains 13,120 web pages with news lists, -significantly exceeding existing datasets in both scale and complexity. Our -dataset contains attributes of various types, including optional and -multi-valued, providing a realistic representation of real-world list pages. -These features make our dataset a valuable resource for studying information -extraction from pages containing many records. - Furthermore, we proposed our own multi-stage information extraction methods. -In this work, we explore and demonstrate several strategies for applying -MarkupLM to the specific challenges of multi-record web pages. Our experiments -validate the advantages of our methods. - By releasing our dataset to the public, we aim to advance the field of -information extraction from multi-record pages. +The rapid advancement of foundation models in medical imaging represents a +significant leap toward enhancing diagnostic accuracy and personalized +treatment. However, the deployment of foundation models in healthcare +necessitates a rigorous examination of their trustworthiness, encompassing +privacy, robustness, reliability, explainability, and fairness. The current +body of survey literature on foundation models in medical imaging reveals +considerable gaps, particularly in the area of trustworthiness. Additionally, +existing surveys on the trustworthiness of foundation models do not adequately +address their specific variations and applications within the medical imaging +domain. This survey aims to fill that gap by presenting a novel taxonomy of +foundation models used in medical imaging and analyzing the key motivations for +ensuring their trustworthiness. We review current research on foundation models +in major medical imaging applications, focusing on segmentation, medical report +generation, medical question and answering (Q\&A), and disease diagnosis. These +areas are highlighted because they have seen a relatively mature and +substantial number of foundation models compared to other applications. We +focus on literature that discusses trustworthiness in medical image analysis +manuscripts. We explore the complex challenges of building trustworthy +foundation models for each application, summarizing current concerns and +strategies for enhancing trustworthiness. Furthermore, we examine the potential +of these models to revolutionize patient care. Our analysis underscores the +imperative for advancing towards trustworthy AI in medical image analysis, +advocating for a balanced approach that fosters innovation while ensuring +ethical and equitable healthcare delivery. -摘要:在本文中,我們專注於從包含大量記錄的網頁中提取資訊的問題,這項任務在海量網路資料的時代中越來越重要。最近,神經網路方法的發展已改善從網頁中提取資訊的品質。儘管如此,大多數的研究和資料集都旨在研究詳細的網頁。儘管多記錄「清單網頁」廣泛存在且具有實用意義,但它們相對來說研究較少。 -為了解決這個差距,我們建立了一個專門針對清單網頁設計的大規模、開放存取的資料集。這是俄語中第一個針對此任務的資料集。我們的資料集包含 13,120 個包含新聞清單的網頁,在規模和複雜度上都遠遠超過現有的資料集。我們的資料集包含各種類型的屬性,包括可選和多值,提供真實世界清單網頁的實際表示。這些特點使我們的資料集成為研究從包含大量記錄的網頁中提取資訊的寶貴資源。 -此外,我們提出了我們自己的多階段資訊提取方法。在這項工作中,我們探討並展示了將 MarkupLM 應用於多記錄網頁特定挑戰的幾種策略。我們的實驗驗證了我們方法的優點。 -透過向公眾發布我們的資料集,我們旨在推進從多記錄網頁中提取資訊的領域。 +摘要:基礎模型在醫學影像方面的快速進展,代表著在加強診斷準確性和個人化治療方面邁出一大步。然而,基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查,包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距,特別是在可信度方面。此外,現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機,來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究,重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調,是因為與其他應用相比,它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰,總結了當前關注點和增強可信度的策略。此外,我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性,並倡導一種平衡的方法,既能促進創新,又能確保道德和公平的醫療保健服務。 + +##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data** +2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan + +Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and +interpreting ultrasound scans right at the patient's bedside. However, the +expertise needed to interpret these images is considerable and may not always +be present in emergency situations. This reality makes algorithms such as +machine learning classifiers extremely valuable to augment human decisions. +POCUS devices are becoming available at a reasonable cost in the size of a +mobile phone. The challenge of turning POCUS devices into life-saving tools is +that interpretation of ultrasound images requires specialist training and +experience. Unfortunately, the difficulty to obtain positive training images +represents an important obstacle to building efficient and accurate +classifiers. Hence, the problem we try to investigate is how to explore +strategies to increase accuracy of classifiers trained with scarce data. We +hypothesize that training with a few data instances may not suffice for +classifiers to generalize causing them to overfit. Our approach uses an +Explainable AI-Augmented approach to help the algorithm learn more from less +and potentially help the classifier better generalize. -##### **Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity** -2502.14620v1 by Xinghan Pan +摘要:床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而,解讀這些影像所需的專業知識相當可觀,而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出,尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於,解讀超音波影像需要專門訓練和經驗。不幸的是,取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此,我們嘗試探討的問題是如何探索策略,以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括,導致它們過度擬合。我們的做法使用可解釋 AI 增強方法,以協助演算法從較少的資料中學習更多,並潛在協助分類器更好地概括。 -This paper investigates the efficacy of RWKV, a novel language model -architecture known for its linear attention mechanism, for generating sentence -embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate -the semantic similarity captured by embeddings from different hidden layers of -a pre-trained RWKV model. The performance is assessed on the Microsoft Research -Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared -against a GloVe-based baseline. My results indicate that while RWKV embeddings -capture some semantic relatedness, they underperform compared to the GloVe -baseline in terms of Spearman correlation. I also analyze the inference time -and GPU memory usage, highlighting the computational trade-offs associated with -RWKV embeddings. The findings suggest that while RWKV offers potential -advantages in terms of linear scaling, its zero-shot sentence embedding quality -for semantic similarity tasks requires further investigation and potential -task-specific fine-tuning to match or exceed simpler baselines. +##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach** +2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang -摘要:本文探討 RWKV 的效能,這是一種以線性注意力機制聞名的語言模型架構,可用於在零次學習設定中產生句子嵌入。我進行逐層分析,以評估預先訓練的 RWKV 模型中不同隱藏層的嵌入所擷取的語義相似性。效能評估使用 Microsoft Research Paraphrase Corpus (MRPC) 資料集,採用 Spearman 相關係數,並與基於 GloVe 的基準進行比較。我的結果顯示,雖然 RWKV 嵌入可以擷取一些語義相關性,但與 GloVe 基準相比,在 Spearman 相關係數方面表現不佳。我也分析了推論時間和 GPU 記憶體使用量,強調與 RWKV 嵌入相關的運算折衷。這些發現表明,雖然 RWKV 在線性縮放方面具有潛在優勢,但其在語義相似性任務中的零次學習句子嵌入品質需要進一步探討,並需要潛在的特定任務微調,才能達到或超越較簡單的基準。 +In recent years, the United States has witnessed a significant surge in the +popularity of vaping or e-cigarette use, leading to a notable rise in cases of +e-cigarette and vaping use-associated lung injury (EVALI) that caused +hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting +the urgency to comprehend vaping behaviors and develop effective strategies for +cessation. Due to the ubiquity of social media platforms, over 4.7 billion +users worldwide use them for connectivity, communications, news, and +entertainment with a significant portion of the discourse related to health, +thereby establishing social media data as an invaluable organic data resource +for public health research. In this study, we extracted a sample dataset from +one vaping sub-community on Reddit to analyze users' quit-vaping intentions. +Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit +vaping intention detection, this study compares the outcomes of this model +against layman and clinical expert annotations. Using different prompting +strategies such as zero-shot, one-shot, few-shot and chain-of-thought +prompting, we developed 8 prompts with varying levels of detail to explain the +task to GPT-4 and also evaluated the performance of the strategies against each +other. These preliminary findings emphasize the potential of GPT-4 in social +media data analysis, especially in identifying users' subtle intentions that +may elude human detection. -##### **Reward Models Identify Consistency, Not Causality** -2502.14619v1 by Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li +摘要:近年來,美國見證了電子煙或電子香菸使用率大幅激增,導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加,在 2019 年 EVALI 爆發期間造成住院和死亡,凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及,全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂,其中很大一部分與健康相關,因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中,我們從 Reddit 上一個電子煙子社群中提取一個範例資料集,以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測,本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略,例如零次學習、一次學習、少次學習和思考鏈提示,我們開發了 8 個提示,詳細程度不同,向 GPT-4 解釋任務,並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力,特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。 -Reward models (RMs) play a crucial role in aligning large language models -(LLMs) with human preferences and enhancing reasoning quality. Traditionally, -RMs are trained to rank candidate outputs based on their correctness and -coherence. However, in this work, we present several surprising findings that -challenge common assumptions about RM behavior. Our analysis reveals that -state-of-the-art reward models prioritize structural consistency over causal -correctness. Specifically, removing the problem statement has minimal impact on -reward scores, whereas altering numerical values or disrupting the reasoning -flow significantly affects RM outputs. Furthermore, RMs exhibit a strong -dependence on complete reasoning trajectories truncated or incomplete steps -lead to significant variations in reward assignments, indicating that RMs -primarily rely on learned reasoning patterns rather than explicit problem -comprehension. These findings hold across multiple architectures, datasets, and -tasks, leading to three key insights: (1) RMs primarily assess coherence rather -than true reasoning quality; (2) The role of explicit problem comprehension in -reward assignment is overstated; (3) Current RMs may be more effective at -ranking responses than verifying logical validity. Our results suggest a -fundamental limitation in existing reward modeling approaches, emphasizing the -need for a shift toward causality-aware reward models that go beyond -consistency-driven evaluation. +##### **Towards Compositional Interpretability for XAI** +2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke -摘要:獎勵模型 (RM) 在將大型語言模型 (LLM) 與人類偏好對齊並提升推理品質方面扮演至關重要的角色。傳統上,RM 會訓練來根據候選輸出的正確性和一致性進行排名。然而,在這項工作中,我們提出幾個令人驚訝的發現,挑戰了關於 RM 行為的常見假設。我們的分析顯示,最先進的獎勵模型優先考慮結構一致性,而不是因果正確性。具體來說,移除問題陳述對獎勵分數的影響很小,而改變數值或中斷推理流程則會顯著影響 RM 輸出。此外,RM 表現出對完整推理軌跡的強烈依賴性,截斷或不完整的步驟會導致獎勵分配產生重大變化,這表示 RM 主要依賴於學習到的推理模式,而不是明確的問題理解。這些發現適用於多種架構、資料集和任務,得出三個關鍵見解:(1) RM 主要評估一致性,而不是真正的推理品質;(2) 在獎勵分配中,明確問題理解的角色被誇大了;(3) 目前的 RM 在排名回應方面可能比驗證邏輯有效性更有效。我們的結果表明現有獎勵建模方法存在根本限制,強調需要轉向因果感知獎勵模型,超越以一致性為導向的評估。 +Artificial intelligence (AI) is currently based largely on black-box machine +learning models which lack interpretability. The field of eXplainable AI (XAI) +strives to address this major concern, being critical in high-stakes areas such +as the finance, legal and health sectors. + We present an approach to defining AI models and their interpretability based +on category theory. For this we employ the notion of a compositional model, +which sees a model in terms of formal string diagrams which capture its +abstract structure together with its concrete implementation. This +comprehensive view incorporates deterministic, probabilistic and quantum +models. We compare a wide range of AI models as compositional models, including +linear and rule-based models, (recurrent) neural networks, transformers, VAEs, +and causal and DisCoCirc models. + Next we give a definition of interpretation of a model in terms of its +compositional structure, demonstrating how to analyse the interpretability of a +model, and using this to clarify common themes in XAI. We find that what makes +the standard 'intrinsically interpretable' models so transparent is brought out +most clearly diagrammatically. This leads us to the more general notion of +compositionally-interpretable (CI) models, which additionally include, for +instance, causal, conceptual space, and DisCoCirc models. + We next demonstrate the explainability benefits of CI models. Firstly, their +compositional structure may allow the computation of other quantities of +interest, and may facilitate inference from the model to the modelled +phenomenon by matching its structure. Secondly, they allow for diagrammatic +explanations for their behaviour, based on influence constraints, diagram +surgery and rewrite explanations. Finally, we discuss many future directions +for the approach, raising the question of how to learn such meaningfully +structured models in practice. -##### **FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis** -2502.14614v1 by Mingyi Jia, Junwen Duan, Yan Song, Jianxin Wang +摘要:人工智慧(AI)目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧(XAI)領域致力於解決這個主要問題,這在金融、法律和健康等高風險領域至關重要。 +我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此,我們採用組合模型的概念,它以形式弦圖的形式看待模型,這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較,包括線性和基於規則的模型、(遞迴)神經網路、Transformer、VAE,以及因果和 DisCoCirc 模型。 +接下來,我們根據模型的組合結構給出模型解釋的定義,展示如何分析模型的可解釋性,並使用它來澄清 XAI 中的常見主題。我們發現,讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋(CI)模型概念,它另外還包括因果、概念空間和 DisCoCirc 模型。 +接下來,我們展示了 CI 模型的可解釋性優勢。首先,它們的組合結構允許計算其他感興趣的量,並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次,它們允許對其行為進行圖解說明,這些說明基於影響約束、圖解手術和重寫說明。最後,我們討論了這種方法的許多未來方向,提出了如何在實踐中學習這種有意義的結構化模型的問題。 -Retrieval-Augmented Large Language Models (LLMs), which integrate external -knowledge into LLMs, have shown remarkable performance in various medical -domains, including clinical diagnosis. However, existing RAG methods struggle -to effectively assess task difficulty to make retrieval decisions, thereby -failing to meet the clinical requirements for balancing efficiency and -accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained -\textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework -that improves the reliability of RAG in disease diagnosis scenarios. FIND -incorporates a fine-grained adaptive control module to determine whether -retrieval is necessary based on the information density of the input. By -optimizing the retrieval process and implementing a knowledge filtering module, -FIND ensures that the retrieval is better suited to clinical scenarios. -Experiments on three Chinese electronic medical record datasets demonstrate -that FIND significantly outperforms various baseline methods, highlighting its -effectiveness in clinical diagnosis tasks. +##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods** +2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen -摘要:檢索增強大型語言模型 (LLM),將外部知識整合至 LLM,已於各種醫療領域展現出卓越效能,包括臨床診斷。然而,現有的 RAG 方法難以有效評估任務難度以做出檢索決策,因此無法滿足平衡效率和精確度的臨床需求。因此,我們在本文中提出 FIND(**F**ine-grained **In**formation **D**ensity Guided Adaptive RAG),一種新穎架構,可提升 RAG 在疾病診斷場景中的可靠性。FIND 整合一個細緻化的自適應控制模組,根據輸入的資訊密度判斷是否需要檢索。透過最佳化檢索程序並實作一個知識過濾模組,FIND 確保檢索更適合臨床場景。在三個中文電子病歷資料集上的實驗顯示,FIND 明顯優於各種基線方法,突顯其在臨床診斷任務中的有效性。 +Machine learning models have achieved high overall accuracy in medical image +analysis. However, performance disparities on specific patient groups pose +challenges to their clinical utility, safety, and fairness. This can affect +known patient groups - such as those based on sex, age, or disease subtype - as +well as previously unknown and unlabeled groups. Furthermore, the root cause of +such observed performance disparities is often challenging to uncover, +hindering mitigation efforts. In this paper, to address these issues, we +leverage Slice Discovery Methods (SDMs) to identify interpretable +underperforming subsets of data and formulate hypotheses regarding the cause of +observed performance disparities. We introduce a novel SDM and apply it in a +case study on the classification of pneumothorax and atelectasis from chest +x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis +formulation and yields an explanation of previously observed but unexplained +performance disparities between male and female patients in widely used chest +X-ray datasets and models. Our findings indicate shortcut learning in both +classification tasks, through the presence of chest drains and ECG wires, +respectively. Sex-based differences in the prevalence of these shortcut +features appear to cause the observed classification performance gap, +representing a previously underappreciated interaction between shortcut +learning and model fairness analyses. -##### **Behavioral Analysis of Information Salience in Large Language Models** -2502.14613v1 by Jan Trienes, Jörg Schlötterer, Junyi Jessy Li, Christin Seifert +摘要:機器學習模型在醫學影像分析中已達到整體高準確度。然而,特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體(例如基於性別、年齡或疾病亞型)以及先前未知且未標籤的群體。此外,此類觀察到的效能差異的根本原因通常難以發現,阻礙了緩解措施。在本文中,為了解決這些問題,我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集,並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM,並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性,並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明,在分類任務中,透過胸腔引流管和心電圖導線的存在,存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異,似乎會導致觀察到的分類效能差距,這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。 -Large Language Models (LLMs) excel at text summarization, a task that -requires models to select content based on its importance. However, the exact -notion of salience that LLMs have internalized remains unclear. To bridge this -gap, we introduce an explainable framework to systematically derive and -investigate information salience in LLMs through their summarization behavior. -Using length-controlled summarization as a behavioral probe into the content -selection process, and tracing the answerability of Questions Under Discussion -throughout, we derive a proxy for how models prioritize information. Our -experiments on 13 models across four datasets reveal that LLMs have a nuanced, -hierarchical notion of salience, generally consistent across model families and -sizes. While models show highly consistent behavior and hence salience -patterns, this notion of salience cannot be accessed through introspection, and -only weakly correlates with human perceptions of information salience. +##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health** +2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa -摘要:大型語言模型 (LLM) 在文字摘要方面表現出色,這項任務需要模型根據重要性來選擇內容。然而,LLM 內化的顯著性準確概念仍不清楚。為了彌補這個差距,我們引入了一個可解釋的架構,透過摘要行為系統性地推導和調查 LLM 中的資訊顯著性。使用長度控制摘要作為行為探測來探討內容選擇過程,並追蹤討論中問題的可回答性,我們推導出一個模型優先處理資訊的方式代理。我們針對四個資料集中的 13 個模型進行的實驗揭示,LLM 具有細緻入微、階層式的顯著性概念,通常在模型系列和大小之間保持一致。雖然模型表現出高度一致的行為,因此具有顯著性模式,但這個顯著性概念無法透過內省來存取,而且與人類對資訊顯著性的認知僅有微弱相關性。 +The concept of Metaverse has attracted a lot of attention in various fields +and one of its important applications is health and treatment. The Metaverse +has enormous potential to transform healthcare by changing patient care, +medical education, and the way teaching/learning and research are done. The +purpose of this research is to provide an introduction to the basic concepts +and fundamental technologies of the Metaverse. This paper examines the pros and +cons of the Metaverse in healthcare context and analyzes its potential from the +technology and AI perspective. In particular, the role of machine learning +methods is discussed; We will explain how machine learning algorithms can be +applied to the Metaverse generated data to gain better insights in healthcare +applications. Additionally, we examine the future visions of the Metaverse in +health delivery, by examining emerging technologies such as blockchain and also +addressing privacy concerns. The findings of this study contribute to a deeper +understanding of the applications of Metaverse in healthcare and its potential +to revolutionize the delivery of medical services. -##### **A Theory for Conditional Generative Modeling on Multiple Data Sources** -2502.14583v1 by Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu +摘要:元宇宙的概念在各個領域都備受關注,其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育,以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點,並從技術和 AI 的角度分析其潛力。特別是,討論了機器學習方法的角色;我們將說明如何將機器學習演算法應用於元宇宙產生的資料,以獲得醫療保健應用方面的更佳見解。此外,我們透過探討區塊鏈等新興技術,並解決隱私問題,來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用,以及其在醫療服務提供方面發揮革命性變革的潛力。 -The success of large generative models has driven a paradigm shift, -leveraging massive multi-source data to enhance model capabilities. However, -the interaction among these sources remains theoretically underexplored. This -paper takes the first step toward a rigorous analysis of multi-source training -in conditional generative modeling, where each condition represents a distinct -data source. Specifically, we establish a general distribution estimation error -bound in average total variation distance for conditional maximum likelihood -estimation based on the bracketing number. Our result shows that when source -distributions share certain similarities and the model is expressive enough, -multi-source training guarantees a sharper bound than single-source training. -We further instantiate the general theory on conditional Gaussian estimation -and deep generative models including autoregressive and flexible energy-based -models, by characterizing their bracketing numbers. The results highlight that -the number of sources and similarity among source distributions improve the -advantage of multi-source training. Simulations and real-world experiments -validate our theory. Code is available at: -\url{https://github.com/ML-GSAI/Multi-Source-GM}. +##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI** +2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf -摘要:大型生成模型的成功推動了範例轉移,利用大量多來源資料來增強模型功能。然而,這些來源之間的互動在理論上仍未得到充分探討。本文踏出了嚴謹分析條件生成模型中多來源訓練的第一步,其中每個條件代表一個不同的資料來源。具體來說,我們建立了一個基於括號數的條件最大似然估計的平均總變異距離中的通用分佈估計誤差界限。我們的結果表明,當來源分佈具有一定的相似性且模型具有足夠的表達力時,多來源訓練保證了比單來源訓練更嚴格的界限。我們進一步在條件高斯估計和深度生成模型(包括自迴歸和靈活的基於能量的模型)上例證了通用理論,通過表徵它們的括號數。結果強調了來源數和來源分佈之間的相似性提高了多來源訓練的優勢。模擬和真實世界的實驗驗證了我們的理論。程式碼可在以下網址取得:\url{https://github.com/ML-GSAI/Multi-Source-GM}。 +Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with +no known ultimo cure and high morbidity. Research demonstrates that progressive +Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly +impacts kidney structure and functions, eventually leading to kidney failure. +With the progression of time, chronic kidney disease has moved from a +life-threatening disease affecting few people to a common disorder of varying +severity. The goal of this research is to visualize dominating features, +feature scores, and values exhibited for early prognosis and detection of CKD +using ensemble learning and explainable AI. For that, an AI-driven predictive +analytics approach is proposed to aid clinical practitioners in prescribing +lifestyle modifications for individual patients to reduce the rate of +progression of this disease. Our dataset is collected on body vitals from +individuals with CKD and healthy subjects to develop our proposed AI-driven +solution accurately. In this regard, blood and urine test results are provided, +and ensemble tree-based machine-learning models are applied to predict unseen +cases of CKD. Our research findings are validated after lengthy consultations +with nephrologists. Our experiments and interpretation results are compared +with existing explainable AI applications in various healthcare domains, +including CKD. The comparison shows that our developed AI models, particularly +the Random Forest model, have identified more features as significant +contributors than XgBoost. Interpretability (I), which measures the ratio of +important to masked features, indicates that our XgBoost model achieved a +higher score, specifically a Fidelity of 98\%, in this metric and naturally in +the FII index compared to competing models. -##### **A Statistical Case Against Empirical Human-AI Alignment** -2502.14581v1 by Julian Rodemann, Esteban Garces Arias, Christoph Luther, Christoph Jansen, Thomas Augustin +摘要:慢性腎臟病 (CKD) 是一種廣泛的慢性疾病,目前尚未找到最終的治療方法,且發病率很高。研究表明,進行性慢性腎臟病 (CKD) 是一種異質性疾病,會顯著影響腎臟結構和功能,最終導致腎衰竭。隨著時間的推移,慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值,以進行 CKD 的早期預後和檢測。為此,提出了一種 AI 驅動的預測分析方法,以幫助臨床醫生為個別患者開具生活方式的修改建議,以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的,以準確開發我們提出的 AI 驅動的解決方案。在這方面,提供了血液和尿液檢測結果,並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較,包括 CKD。比較表明,我們開發的 AI 模型,特別是隨機森林模型,已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率,表明我們的 XgBoost 模型在此指標中取得了更高的分數,特別是 98% 的保真度,並且在 FII 指數中自然高於競爭模型。 -Empirical human-AI alignment aims to make AI systems act in line with -observed human behavior. While noble in its goals, we argue that empirical -alignment can inadvertently introduce statistical biases that warrant caution. -This position paper thus advocates against naive empirical alignment, offering -prescriptive alignment and a posteriori empirical alignment as alternatives. We -substantiate our principled argument by tangible examples like human-centric -decoding of language models. +##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook** +2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan -摘要:經驗主義的人工智慧校準旨在使人工智慧系統根據觀察到的人類行為採取行動。儘管目標崇高,我們認為經驗主義校準可能會無意中引入需要謹慎對待的統計偏差。因此,本立場文件主張反對天真的經驗主義校準,提供規範性校準和後驗經驗主義校準作為替代方案。我們以具體的例子(例如以人為中心的語言模型解碼)來證明我們的原則性論點。 +Mental health constitutes a complex and pervasive global challenge, affecting +millions of lives and often leading to severe consequences. In this paper, we +conduct a thorough survey to explore the intersection of data science, +artificial intelligence, and mental healthcare, focusing on the recent +developments of mental disorder detection through online social media (OSM). A +significant portion of the population actively engages in OSM platforms, +creating a vast repository of personal data that holds immense potential for +mental health analytics. The paper navigates through traditional diagnostic +methods, state-of-the-art data- and AI-driven research studies, and the +emergence of explainable AI (XAI) models for mental healthcare. We review +state-of-the-art machine learning methods, particularly those based on modern +deep learning, while emphasising the need for explainability in healthcare AI +models. The experimental design section provides insights into prevalent +practices, including available datasets and evaluation approaches. We also +identify key issues and challenges in the field and propose promising future +research directions. As mental health decisions demand transparency, +interpretability, and ethical considerations, this paper contributes to the +ongoing discourse on advancing XAI in mental healthcare through social media. +The comprehensive overview presented here aims to guide researchers, +practitioners, and policymakers in developing the area of mental disorder +detection. -##### **ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification** -2502.14565v1 by Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack +摘要:心理健康構成了一項複雜且普遍的全球挑戰,影響了數百萬人的生活,並經常導致嚴重的後果。在本文中,我們進行了一項徹底的調查,以探索數據科學、人工智慧和心理保健的交集,重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台,創造了一個龐大的人員資料庫,對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究,以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法,特別是那些基於現代深度學習的方法,同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解,包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰,並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量,本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。 -Self-awareness, i.e., the ability to assess and correct one's own generation, -is a fundamental aspect of human intelligence, making its replication in large -language models (LLMs) an important yet challenging task. Previous works tackle -this by employing extensive reinforcement learning or rather relying on large -external verifiers. In this work, we propose Refine via Intrinsic -Self-Verification (ReVISE), an efficient and effective framework that enables -LLMs to self-correct their outputs through self-verification. The core idea of -ReVISE is to enable LLMs to verify their reasoning processes and continually -rethink reasoning trajectories based on its verification. We introduce a -structured curriculum based upon online preference learning to implement this -efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., -self-verification and reasoning correction), we tackle each task sequentially -using curriculum learning, collecting both failed and successful reasoning -paths to construct preference pairs for efficient training. During inference, -our approach enjoys natural test-time scaling by integrating self-verification -and correction capabilities, further enhanced by our proposed confidence-aware -decoding mechanism. Our experiments on various reasoning tasks demonstrate that -ReVISE achieves efficient self-correction and significantly improves reasoning -performance. +##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance** +2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou -摘要:自我覺察,亦即評估和修正自身產出的能力,是人類智慧的基本面向,使其能在大型語言模型 (LLM) 中複製,是一項重要且具挑戰性的任務。先前的研究透過採用廣泛的強化學習或依賴大型外部驗證器來解決這個問題。在這項研究中,我們提出透過內在自我驗證 (ReVISE) 進行精煉,一個有效率且有效的架構,使 LLM 能透過自我驗證來自我修正其產出。ReVISE 的核心概念是讓 LLM 能驗證其推理過程,並根據驗證結果持續重新思考推理軌跡。我們導入一個建構於線上偏好學習的結構化課程,以有效率地實作這項功能。具體來說,由於 ReVISE 涉及兩項具有挑戰性的任務(即自我驗證和推理修正),我們使用課程學習循序漸進地處理每一項任務,收集失敗和成功的推理路徑,以建構偏好對,進行有效率的訓練。在推論期間,我們的作法透過整合自我驗證和修正功能,享有自然的測試時間擴充,並進一步透過我們提出的具備信心感知的解碼機制進行強化。我們在各種推理任務上的實驗顯示,ReVISE 達到有效率的自我修正,並顯著提升推理效能。 +AI-aided clinical diagnosis is desired in medical care. Existing deep +learning models lack explainability and mainly focus on image analysis. The +recently developed Dynamic Uncertain Causality Graph (DUCG) approach is +causality-driven, explainable, and invariant across different application +scenarios, without problems of data collection, labeling, fitting, privacy, +bias, generalization, high cost and high energy consumption. Through close +collaboration between clinical experts and DUCG technicians, 46 DUCG models +covering 54 chief complaints were constructed. Over 1,000 diseases can be +diagnosed without triage. Before being applied in real-world, the 46 DUCG +models were retrospectively verified by third-party hospitals. The verified +diagnostic precisions were no less than 95%, in which the diagnostic precision +for every disease including uncommon ones was no less than 80%. After +verifications, the 46 DUCG models were applied in the real-world in China. Over +one million real diagnosis cases have been performed, with only 17 incorrect +diagnoses identified. Due to DUCG's transparency, the mistakes causing the +incorrect diagnoses were found and corrected. The diagnostic abilities of the +clinicians who applied DUCG frequently were improved significantly. Following +the introduction to the earlier presented DUCG methodology, the recommendation +algorithm for potential medical checks is presented and the key idea of DUCG is +extracted. -##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** -2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao +摘要:醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性,並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的,並且在不同的應用場景中是不變的,沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作,構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前,46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%,其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後,46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例,僅發現 17 個不正確的診斷。由於 DUCG 的透明性,發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後,提出了潛在健康檢查的推薦演算法,並提取了 DUCG 的關鍵思想。 -Large Language Models (LLMs) have demonstrated exceptional abilities in -reasoning for task planning. However, challenges remain under-explored for -parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in -which the model first decomposes a real-life textual task into executable -subtasks and constructs an abstract task graph. The model then understands this -task graph as input and generates a plan for parallel execution. To enhance the -planning capability of complex, scalable graphs, we design an automated and -controllable pipeline to generate synthetic graphs and propose a two-stage -training scheme. Experimental results show that our plan-over-graph method -significantly improves task performance on both API-based LLMs and trainable -open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally -supports parallel execution, demonstrating global efficiency. The code and data -are available at https://github.com/zsq259/Plan-over-Graph. +##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability** +2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi -摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 +It is imperative that breast cancer is detected precisely and timely to +improve patient outcomes. Diagnostic methodologies have traditionally relied on +unimodal approaches; however, medical data analytics is integrating diverse +data sources beyond conventional imaging. Using multi-modal techniques, +integrating both image and non-image data, marks a transformative advancement +in breast cancer diagnosis. The purpose of this review is to explore the +burgeoning field of multimodal techniques, particularly the fusion of +histopathology images with non-image data. Further, Explainable AI (XAI) will +be used to elucidate the decision-making processes of complex algorithms, +emphasizing the necessity of explainability in diagnostic processes. This +review utilizes multi-modal data and emphasizes explainability to enhance +diagnostic accuracy, clinician confidence, and patient engagement, ultimately +fostering more personalized treatment strategies for breast cancer, while also +identifying research gaps in multi-modality and explainability, guiding future +studies, and contributing to the strategic direction of the field. -##### **Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs** -2502.14561v1 by Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos +摘要:精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法;然而,醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術,標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域,特別是將組織病理學影像與非影像資料融合。此外,可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程,強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性,以提高診斷準確性、臨床醫師的信心和患者參與度,最終促進乳癌更個人化的治療策略,同時也找出多模式和可解釋性的研究差距,引導未來的研究,並為該領域的策略方向做出貢獻。 -This work investigates the ability of open Large Language Models (LLMs) to -predict citation intent through in-context learning and fine-tuning. Unlike -traditional approaches that rely on pre-trained models like SciBERT, which -require extensive domain-specific pretraining and specialized architectures, we -demonstrate that general-purpose LLMs can be adapted to this task with minimal -task-specific data. We evaluate twelve model variations across five prominent -open LLM families using zero, one, few, and many-shot prompting to assess -performance across scenarios. Our experimental study identifies the -top-performing model through extensive experimentation of in-context -learning-related parameters, which we fine-tune to further enhance task -performance. The results highlight the strengths and limitations of LLMs in -recognizing citation intents, providing valuable insights for model selection -and prompt engineering. Additionally, we make our end-to-end evaluation -framework and models openly available for future use. +##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection** +2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya -摘要:本研究探討開放式大型語言模型 (LLM) 透過情境學習和微調來預測引文意圖的能力。與依賴於預訓練模型(例如 SciBERT)的傳統方法不同,後者需要廣泛的特定領域預訓練和專業架構,我們證明了通用 LLM 可以使用最少的特定任務數據來適應此任務。我們使用零次、一次、少次和多次提示評估五個著名的開放式 LLM 家族中的十二個模型變體,以評估不同場景的效能。我們的實驗研究透過廣泛的實驗來識別情境學習相關參數中效能最佳的模型,我們微調這些參數以進一步增強任務效能。結果突顯了 LLM 在識別引文意圖方面的優點和限制,為模型選擇和提示工程提供了有價值的見解。此外,我們將端到端評估架構和模型公開供未來使用。 +The neonatal period is the most vulnerable time for the development of +seizures. Seizures in the immature brain lead to detrimental consequences, +therefore require early diagnosis. The gold-standard for neonatal seizure +detection currently relies on continuous video-EEG monitoring; which involves +recording multi-channel electroencephalogram (EEG) alongside real-time video +monitoring within a neonatal intensive care unit (NICU). However, video-EEG +monitoring technology requires clinical expertise and is often limited to +technologically advanced and resourceful settings. Cost-effective new +techniques could help the medical fraternity make an accurate diagnosis and +advocate treatment without delay. In this work, a novel explainable deep +learning model to automate the neonatal seizure detection process with a +reduced EEG montage is proposed, which employs convolutional nets, graph +attention layers, and fully connected layers. Beyond its ability to detect +seizures in real-time with a reduced montage, this model offers the unique +advantage of real-time interpretability. By evaluating the performance on the +Zenodo dataset with 10-fold cross-validation, the presented model achieves an +absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall, +respectively. -##### **Less is More: Improving LLM Alignment via Preference Data Selection** -2502.14560v1 by Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He +摘要:新生兒期是大腦發育最脆弱的時期,容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果,因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測;其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而,視訊腦電圖監控技術需要臨床專業知識,而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中,提出了一個新穎的可解釋深度學習模型,以自動化新生兒癲癇發作偵測過程,並採用減少的腦電圖裝置,其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外,此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能,所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。 -Direct Preference Optimization (DPO) has emerged as a promising approach for -aligning large language models with human preferences. While prior work mainly -extends DPO from the aspect of the objective function, we instead improve DPO -from the largely overlooked but critical aspect of data selection. -Specifically, we address the issue of parameter shrinkage caused by noisy data -by proposing a novel margin-maximization principle for dataset curation in DPO -training. To accurately estimate margins for data selection, we propose a -dual-margin guided approach that considers both external reward margins and -implicit DPO reward margins. Extensive experiments demonstrate that our method -reduces computational cost dramatically while improving performance. -Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach -achieves 3\% to 8\% improvements across various Llama and Mistral series models -on the AlpacaEval 2.0 benchmark. Furthermore, our approach seamlessly extends -to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, -while further reducing training time. These results highlight the potential of -data selection strategies for advancing preference optimization. +##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques** +2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik -摘要:直接偏好最佳化 (DPO) 已成為一種有希望的方法,可將大型語言模型與人類偏好保持一致。雖然先前的研究主要從目標函數的角度延伸 DPO,但我們反而從資料選擇這個極易被忽略但至關重要的角度改進 DPO。 -具體來說,我們透過提出一個用於 DPO 訓練中資料集整理的新邊際最大化原則,來解決由雜訊資料造成的參數收縮問題。為了準確估計資料選擇的邊際,我們提出一個雙邊際引導方法,它同時考慮外部獎勵邊際和隱含 DPO 獎勵邊際。大規模的實驗證明,我們的這種方法大幅降低了運算成本,同時改善了效能。 -值得注意的是,我們的這種方法僅使用 Ultrafeedback 資料集的 10%,便在 AlpacaEval 2.0 基準上,在各種 Llama 和 Mistral 系列模型中取得了 3% 到 8% 的改進。此外,我們的這種方法可以無縫地延伸到迭代 DPO,在使用 25% 線上資料的情況下產生了大約 3% 的改進,同時進一步減少了訓練時間。這些結果突顯了資料選擇策略在推進偏好最佳化方面的潛力。 +Breast cancer (BC) stands as one of the most common malignancies affecting +women worldwide, necessitating advancements in diagnostic methodologies for +better clinical outcomes. This article provides a comprehensive exploration of +the application of Explainable Artificial Intelligence (XAI) techniques in the +detection and diagnosis of breast cancer. As Artificial Intelligence (AI) +technologies continue to permeate the healthcare sector, particularly in +oncology, the need for transparent and interpretable models becomes imperative +to enhance clinical decision-making and patient care. This review discusses the +integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and +others, with machine learning and deep learning models utilized in breast +cancer detection and classification. By investigating the modalities of breast +cancer datasets, including mammograms, ultrasounds and their processing with +AI, the paper highlights how XAI can lead to more accurate diagnoses and +personalized treatment plans. It also examines the challenges in implementing +these techniques and the importance of developing standardized metrics for +evaluating XAI's effectiveness in clinical settings. Through detailed analysis +and discussion, this article aims to highlight the potential of XAI in bridging +the gap between complex AI models and practical healthcare applications, +thereby fostering trust and understanding among medical professionals and +improving patient outcomes. -##### **FUIA: Model Inversion Attack against Federated Unlearning** -2502.14558v1 by Lei Zhou, Youwen Zhu +摘要:乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一,因此需要進步的診斷方法,以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域,特別是在腫瘤學中,透明且可解釋的模型需求變得勢在必行,以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合,例如 SHAP、LIME、Grad-CAM 等,以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式,包括乳房攝影、超音波及其在 AI 中的處理,本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰,以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論,本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力,進而促進醫療專業人員之間的信任與理解,並改善患者的結果。 -With the introduction of regulations related to the ``right to be forgotten", -federated learning (FL) is facing new privacy compliance challenges. To address -these challenges, researchers have proposed federated unlearning (FU). However, -existing FU research has primarily focused on improving the efficiency of -unlearning, with less attention paid to the potential privacy vulnerabilities -inherent in these methods. To address this gap, we draw inspiration from -gradient inversion attacks in FL and propose the federated unlearning inversion -attack (FUIA). The FUIA is specifically designed for the three types of FU -(sample unlearning, client unlearning, and class unlearning), aiming to provide -a comprehensive analysis of the privacy leakage risks associated with FU. In -FUIA, the server acts as an honest-but-curious attacker, recording and -exploiting the model differences before and after unlearning to expose the -features and labels of forgotten data. FUIA significantly leaks the privacy of -forgotten data and can target all types of FU. This attack contradicts the goal -of FU to eliminate specific data influence, instead exploiting its -vulnerabilities to recover forgotten data and expose its privacy flaws. -Extensive experimental results show that FUIA can effectively reveal the -private information of forgotten data. To mitigate this privacy leakage, we -also explore two potential defense methods, although these come at the cost of -reduced unlearning effectiveness and the usability of the unlearned model. +##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition** +2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara -摘要:隨著「被遺忘權」相關法規的推出, -聯盟學習 (FL) 面臨新的隱私合規挑戰。為了應對 -這些挑戰,研究人員提出了聯盟取消學習 (FU)。然而, -現有的 FU 研究主要集中在提高取消學習的效率,較少關注這些方法中固有的潛在隱私漏洞。為了解決這個差距,我們從 -FL 中的梯度反演攻擊中汲取靈感,並提出聯盟取消學習反演 -攻擊 (FUIA)。FUIA 專門設計用於三種類型的 FU -(樣本取消學習、客戶端取消學習和類別取消學習),旨在提供 -對與 FU 相關的隱私洩露風險的全面分析。在 -FUIA 中,伺服器充當誠實但好奇的攻擊者,記錄並 -利用取消學習前後的模型差異來揭露遺忘資料的功能和標籤。FUIA 大幅洩露遺忘資料的隱私,並且可以針對所有類型的 FU。此攻擊與 FU 消除特定資料影響的目標相矛盾,而是利用其 -漏洞來恢復遺忘資料並揭露其隱私缺陷。廣泛的實驗結果表明 FUIA 可以有效揭露遺忘資料的私人資訊。為了減輕這種隱私洩露,我們 -還探索了兩種潛在的防禦方法,儘管這些方法以降低取消學習的有效性和已取消學習模型的可用性為代價。 +Speech emotion recognition (SER) has gained significant attention due to its +several application fields, such as mental health, education, and +human-computer interaction. However, the accuracy of SER systems is hindered by +high-dimensional feature sets that may contain irrelevant and redundant +information. To overcome this challenge, this study proposes an iterative +feature boosting approach for SER that emphasizes feature relevance and +explainability to enhance machine learning model performance. Our approach +involves meticulous feature selection and analysis to build efficient SER +systems. In addressing our main problem through model explainability, we employ +a feature evaluation loop with Shapley values to iteratively refine feature +sets. This process strikes a balance between model performance and +transparency, which enables a comprehensive understanding of the model's +predictions. The proposed approach offers several advantages, including the +identification and removal of irrelevant and redundant features, leading to a +more effective model. Additionally, it promotes explainability, facilitating +comprehension of the model's predictions and the identification of crucial +features for emotion determination. The effectiveness of the proposed method is +validated on the SER benchmarks of the Toronto emotional speech set (TESS), +Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of +Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion +(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our +knowledge, this is the first work to incorporate model explainability into an +SER framework. The source code of this paper is publicly available via this +https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition. -##### **Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling** -2502.14553v1 by Eric Egli, Matteo Manica, Jannis Born +摘要:語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而,SER 系統的準確性受到高維特徵集的阻礙,這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰,本研究提出了一種用於 SER 的迭代特徵提升方法,該方法強調特徵相關性和可解釋性,以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析,以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題,我們採用了具有 Shapley 值的特徵評估迴圈,以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡,這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點,包括識別和移除不相關和冗餘的特徵,從而建立更有效的模型。此外,它促進了可解釋性,有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證,其效能優於現有方法。據我們所知,這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得:https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。 -Bytes form the basis of the digital world and thus are a promising building -block for multimodal foundation models. Recently, Byte Language Models (BLMs) -have emerged to overcome tokenization, yet the excessive length of bytestreams -requires new architectural paradigms. Therefore, we present the Multiscale Byte -Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows -training with context windows of $5$M bytes on single GPU in full model -precision. We thoroughly examine MBLM's performance with Transformer and Mamba -blocks on both unimodal and multimodal tasks. Our experiments demonstrate that -hybrid architectures are efficient in handling extremely long byte sequences -during training while achieving near-linear generational efficiency. To the -best of our knowledge, we present the first evaluation of BLMs on visual Q\&A -tasks and find that, despite serializing images and the absence of an encoder, -a MBLM with pure next token prediction can match custom CNN-LSTM architectures -with designated classification heads. We show that MBLMs exhibit strong -adaptability in integrating diverse data representations, including pixel and -image filestream bytes, underlining their potential toward omnimodal foundation -models. Source code is publicly available at: -https://github.com/ai4sd/multiscale-byte-lm +##### **The Explanation Necessity for Healthcare AI** +2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling + +Explainability is often critical to the acceptable implementation of +artificial intelligence (AI). Nowhere is this more important than healthcare +where decision-making directly impacts patients and trust in AI systems is +essential. This trust is often built on the explanations and interpretations +the AI provides. Despite significant advancements in AI interpretability, there +remains the need for clear guidelines on when and to what extent explanations +are necessary in the medical context. We propose a novel categorization system +with four distinct classes of explanation necessity, guiding the level of +explanation required: patient or sample (local) level, cohort or dataset +(global) level, or both levels. We introduce a mathematical formulation that +distinguishes these categories and offers a practical framework for researchers +to determine the necessity and depth of explanations required in medical AI +applications. Three key factors are considered: the robustness of the +evaluation protocol, the variability of expert observations, and the +representation dimensionality of the application. In this perspective, we +address the question: When does an AI medical application need to be explained, +and at what level of detail? -摘要:位元組構成數位世界的基礎,因此是多模態基礎模型的一個有前途的建構模組。最近,位元組語言模型 (BLM) 已應運而生,以克服標記化,但位元組串流的過長需要新的架構範例。因此,我們提出多尺度位元組語言模型 (MBLM),這是一個與模型無關的分層解碼器堆疊,允許在單一 GPU 上以完整的模型精度訓練 500 萬位元組的內容視窗。我們徹底檢驗了 MBLM 在單模態和多模態任務上使用 Transformer 和 Mamba 區塊的效能。我們的實驗證明,混合架構在處理訓練期間極長的位元組序列時很有效率,同時達到近乎線性的生成效率。據我們所知,我們提出在視覺問答任務上對 BLM 的首次評估,並發現,儘管序列化影像且沒有編碼器,但具有純粹下一個標記預測的 MBLM 可以匹配具有指定分類標頭的客製化 CNN-LSTM 架構。我們表明,MBLM 在整合各種資料表示形式方面表現出強大的適應性,包括像素和影像檔案串流位元組,強調它們朝向全模態基礎模型的潛力。原始碼已公開於: -https://github.com/ai4sd/multiscale-byte-lm +摘要:可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域,这一点尤为重要,因为决策直接影响患者,并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展,但仍然需要明确的指导方针,说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统,该系统具有四种不同的解释必要性类别,指导所需的解释级别:患者或样本(局部)级别、队列或数据集(全局)级别,或两个级别。我们引入了一个数学公式,该公式区分了这些类别,并为研究人员提供了一个实用框架,以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素:评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看,我们解决了这个问题:AI 医疗应用何时需要解释,以及需要解释到何种程度? -##### **Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks** -2502.14546v1 by Maya Bechler-Speicher, Ben Finkelshtein, Fabrizio Frasca, Luis Müller, Jan Tönshoff, Antoine Siraudin, Viktor Zaverkin, Michael M. Bronstein, Mathias Niepert, Bryan Perozzi, Mikhail Galkin, Christopher Morris +##### **Interdisciplinary Expertise to Advance Equitable Explainable AI** +2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles -While machine learning on graphs has demonstrated promise in drug design and -molecular property prediction, significant benchmarking challenges hinder its -further progress and relevance. Current benchmarking practices often lack focus -on transformative, real-world applications, favoring narrow domains like -two-dimensional molecular graphs over broader, impactful areas such as -combinatorial optimization, relational databases, or chip design. Additionally, -many benchmark datasets poorly represent the underlying data, leading to -inadequate abstractions and misaligned use cases. Fragmented evaluations and an -excessive focus on accuracy further exacerbate these issues, incentivizing -overfitting rather than fostering generalizable insights. These limitations -have prevented the development of truly useful graph foundation models. This -position paper calls for a paradigm shift toward more meaningful benchmarks, -rigorous evaluation protocols, and stronger collaboration with domain experts -to drive impactful and reliable advances in graph learning research, unlocking -the potential of graph learning. +The field of artificial intelligence (AI) is rapidly influencing health and +healthcare, but bias and poor performance persists for populations who face +widespread structural oppression. Previous work has clearly outlined the need +for more rigorous attention to data representativeness and model performance to +advance equity and reduce bias. However, there is an opportunity to also +improve the explainability of AI by leveraging best practices of social +epidemiology and health equity to help us develop hypotheses for associations +found. In this paper, we focus on explainable AI (XAI) and describe a framework +for interdisciplinary expert panel review to discuss and critically assess AI +model explanations from multiple perspectives and identify areas of bias and +directions for future research. We emphasize the importance of the +interdisciplinary expert panel to produce more accurate, equitable +interpretations which are historically and contextually informed. +Interdisciplinary panel discussions can help reduce bias, identify potential +confounders, and identify opportunities for additional research where there are +gaps in the literature. In turn, these insights can suggest opportunities for +AI model improvement. -摘要:儘管圖形上的機器學習在藥物設計和分子屬性預測方面已展現潛力,但顯著的基準挑戰阻礙了其進一步進展和相關性。目前的基準實務往往缺乏對轉型性、真實世界應用的關注,偏好於狹窄的領域,例如二維分子圖形,而不是組合最佳化、關係資料庫或晶片設計等更廣泛、更有影響力的領域。此外,許多基準資料集無法充分表示基礎資料,導致抽象化不充分和使用案例錯位。支離破碎的評估和過度關注準確性進一步加劇了這些問題,激勵過度擬合,而不是培養可概括的見解。這些限制阻礙了真正有用的圖形基礎模型的開發。這篇立場文件呼籲將範例轉變為更有意義的基準、嚴格的評估協定,以及與領域專家的更強大合作,以推動圖形學習研究中具有影響力和可靠性的進展,釋放圖形學習的潛力。 +摘要:人工智慧 (AI) 領域正快速影響著健康與醫療保健,但對於面臨廣泛結構性壓迫的人群來說,偏見和不良表現依然存在。先前的研究已清楚說明,需要更嚴格地注意資料代表性和模型效能,以促進公平性並減少偏見。然而,我們有機會透過運用社會流行病學和健康公平的最佳實務,來改善 AI 的可解釋性,以幫助我們針對發現的關聯性,發展假設。在本文中,我們專注於可解釋 AI (XAI),並描述一個跨領域專家小組審查架構,以從多重觀點討論和批判性評估 AI 模型的解釋,並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要,而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素,並在文獻中有缺口時找出額外研究的機會。反過來,這些見解可以建議 AI 模型改進的機會。 -##### **LLM-based User Profile Management for Recommender System** -2502.14541v1 by Seunghwan Bang, Hwanjun Song +##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts** +2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen -The rapid advancement of Large Language Models (LLMs) has opened new -opportunities in recommender systems by enabling zero-shot recommendation -without conventional training. Despite their potential, most existing works -rely solely on users' purchase histories, leaving significant room for -improvement by incorporating user-generated textual data, such as reviews and -product descriptions. Addressing this gap, we propose PURE, a novel LLM-based -recommendation framework that builds and maintains evolving user profiles by -systematically extracting and summarizing key information from user reviews. -PURE consists of three core components: a Review Extractor for identifying user -preferences and key product features, a Profile Updater for refining and -updating user profiles, and a Recommender for generating personalized -recommendations using the most current profile. To evaluate PURE, we introduce -a continuous sequential recommendation task that reflects real-world scenarios -by adding reviews over time and updating predictions incrementally. Our -experimental results on Amazon datasets demonstrate that PURE outperforms -existing LLM-based methods, effectively leveraging long-term user information -while managing token limitations. +Artificial Intelligence (AI) repeatedly match or outperform radiologists in +lab experiments. However, real-world implementations of radiological AI-based +systems are found to provide little to no clinical value. This paper explores +how to design AI for clinical usefulness in different contexts. We conducted 19 +design sessions and design interventions with 13 radiologists from 7 clinical +sites in Denmark and Kenya, based on three iterations of a functional AI-based +prototype. Ten sociotechnical dependencies were identified as crucial for the +design of AI in radiology. We conceptualised four technical dimensions that +must be configured to the intended clinical context of use: AI functionality, +AI medical focus, AI decision threshold, and AI Explainability. We present four +design recommendations on how to address dependencies pertaining to the medical +knowledge, clinic type, user expertise level, patient context, and user +situation that condition the configuration of these technical dimensions. -摘要:大型語言模型 (LLM) 的快速進步為推薦系統開啟了新的機會,它能實現零次學習推薦,而無需傳統訓練。儘管有潛力,但現有的大部分工作僅依賴於使用者的購買記錄,透過納入使用者產生的文字資料,例如評論和產品說明,仍有很大的改進空間。針對此差距,我們提出 PURE,一個新穎的基於 LLM 的推薦架構,透過系統性地從使用者評論中提取和總結關鍵資訊,建立並維護不斷演進的使用者檔案。PURE 由三個核心組成部分組成:一個評論萃取器,用於識別使用者的喜好和產品主要功能;一個檔案更新器,用於精煉和更新使用者檔案;一個推薦器,用於使用最新的檔案產生個人化推薦。為了評估 PURE,我們引入一個連續順序推薦任務,透過隨著時間新增評論和遞增更新預測,反映真實世界的場景。我們在 Amazon 資料集上的實驗結果證明,PURE 優於現有的基於 LLM 的方法,在管理符號限制的同時,有效地利用長期使用者資訊。 +摘要:人工智慧(AI)在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而,發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代,在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向,必須根據預期的臨床使用情境進行設定:AI 功能、AI 醫療重點、AI 決策門檻,以及 AI 可解釋性。我們提出四項設計建議,說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境,以及影響這些技術面向設定的使用者情境相關的依賴關係。 -##### **LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization** -2502.14538v1 by Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu +##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making** +2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah -Large Language Models (LLMs) have achieved remarkable success in natural -language processing, but their full fine-tuning remains resource-intensive. -Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation -(LoRA), have emerged as a practical solution by approximating parameter updates -with low-rank matrices. However, LoRA often exhibits a "double descent" -phenomenon during fine-tuning, where model performance degrades due to -overfitting and limited expressiveness caused by low-rank constraints. To -address this issue, we propose LoRA-GGPO (Gradient-Guided Perturbation -Optimization), a novel method that leverages gradient and weight norms to -generate targeted perturbations. By optimizing the sharpness of the loss -landscape, LoRA-GGPO guides the model toward flatter minima, mitigating the -double descent problem and improving generalization. Extensive experiments on -natural language understanding (NLU) and generation (NLG) tasks demonstrate -that LoRA-GGPO outperforms LoRA and its state-of-the-art variants. Furthermore, -extended experiments specifically designed to analyze the double descent -phenomenon confirm that LoRA-GGPO effectively alleviates this issue, producing -more robust and generalizable models. Our work provides a robust and efficient -solution for fine-tuning LLMs, with broad applicability in real-world -scenarios. The code is available at https://github.com/llm172/LoRA-GGPO. +With advanced AI/ML, there has been growing research on explainable AI (XAI) +and studies on how humans interact with AI and XAI for effective human-AI +collaborative decision-making. However, we still have a lack of understanding +of how AI systems and XAI should be first presented to users without technical +backgrounds. In this paper, we present the findings of semi-structured +interviews with health professionals (n=12) and students (n=4) majoring in +medicine and health to study how to improve onboarding with AI and XAI. For the +interviews, we built upon human-AI interaction guidelines to create onboarding +materials of an AI system for stroke rehabilitation assessment and AI +explanations and introduce them to the participants. Our findings reveal that +beyond presenting traditional performance metrics on AI, participants desired +benchmark information, the practical benefits of AI, and interaction trials to +better contextualize AI performance, and refine the objectives and performance +of AI. Based on these findings, we highlight directions for improving +onboarding with AI and XAI and human-AI collaborative decision-making. -摘要:大型語言模型 (LLM) 在自然語言處理方面取得了顯著的成功,但它們的完全微調仍然需要大量資源。參數高效微調 (PEFT) 方法(例如低秩適應 (LoRA))已成為一種實用的解決方案,它通過低秩矩陣近似參數更新。然而,LoRA 在微調過程中經常表現出「雙重下降」現象,其中模型性能會因過度擬合和低秩約束導致的表達能力有限而下降。為了解決這個問題,我們提出了 LoRA-GGPO(梯度引導擾動優化),這是一種利用梯度和權重範數來產生目標擾動的新方法。通過優化損失函數曲面的陡度,LoRA-GGPO 引導模型朝向更平坦的最小值,從而減輕雙重下降問題並改善泛化能力。在自然語言理解 (NLU) 和生成 (NLG) 任務中進行的廣泛實驗表明,LoRA-GGPO 優於 LoRA 及其最先進的變體。此外,專門設計用於分析雙重下降現象的延伸實驗證實,LoRA-GGPO 有效地緩解了這個問題,產生了更強大且更具泛化能力的模型。我們的研究為微調 LLM 提供了一個強大且高效的解決方案,在現實世界場景中具有廣泛的適用性。代碼可在 https://github.com/llm172/LoRA-GGPO 獲得。 +摘要:隨著先進的 AI/ML,對可解釋 AI (XAI) 的研究不斷增加,以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而,我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中,我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果,以研究如何改善 AI 和 XAI 的入門。對於訪談,我們建立在人機互動準則之上,為中風康復評估和 AI 解釋的 AI 系統創建入門材料,並將它們介紹給參與者。我們的研究結果表明,除了呈現傳統的 AI 性能指標外,參與者還希望基准信息、AI 的實際好處以及交互試驗,以更好地將 AI 性能情境化,並完善 AI 的目標和性能。根據這些發現,我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。 -##### **CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models** -2502.14529v1 by Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo +##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach** +2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao -Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated -remarkable real-world capabilities, effectively collaborating to complete -complex tasks. While these systems are designed with safety mechanisms, such as -rejecting harmful instructions through alignment, their security remains -largely unexplored. This gap leaves LLM-MASs vulnerable to targeted -disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks -(Corba), a novel and simple yet highly effective attack that disrupts -interactions between agents within an LLM-MAS. Corba leverages two key -properties: its contagious nature allows it to propagate across arbitrary -network topologies, while its recursive property enables sustained depletion of -computational resources. Notably, these blocking attacks often involve -seemingly benign instructions, making them particularly challenging to mitigate -using conventional alignment methods. We evaluate Corba on two widely-used -LLM-MASs, namely, AutoGen and Camel across various topologies and commercial -models. Additionally, we conduct more extensive experiments in open-ended -interactive LLM-MASs, demonstrating the effectiveness of Corba in complex -topology structures and open-source models. Our code is available at: -https://github.com/zhrli324/Corba. +This article uses machine learning (ML) and explainable artificial +intelligence (XAI) techniques to investigate the relationship between +nutritional status and mortality rates associated with Alzheimers disease (AD). +The Third National Health and Nutrition Examination Survey (NHANES III) +database is employed for analysis. The random forest model is selected as the +base model for XAI analysis, and the Shapley Additive Explanations (SHAP) +method is used to assess feature importance. The results highlight significant +nutritional factors such as serum vitamin B12 and glycated hemoglobin. The +study demonstrates the effectiveness of random forests in predicting AD +mortality compared to other diseases. This research provides insights into the +impact of nutrition on AD and contributes to a deeper understanding of disease +progression. -摘要:基於大型語言模型的多主體系統(LLM-MAS)已展現出卓越的真實世界能力,有效地協作以完成複雜任務。儘管這些系統設計有安全機制,例如透過對齊拒絕有害指令,但其安全性仍未得到充分探討。此一缺口讓 LLM-MAS 易受針對性的破壞。在本文中,我們介紹了傳染性遞迴封鎖攻擊(Corba),這是一種新穎且簡單但極為有效的攻擊,會破壞 LLM-MAS 中主體之間的互動。Corba 利用了兩個關鍵特性:其傳染性使其能夠在任意網路拓撲中傳播,而其遞迴特性則能持續耗盡運算資源。值得注意的是,這些封鎖攻擊通常涉及看似良性的指令,這使得使用傳統對齊方法來減輕攻擊特別具有挑戰性。我們在兩個廣泛使用的 LLM-MAS,即 AutoGen 和 Camel 上評估了 Corba,涵蓋了各種拓撲和商業模型。此外,我們在開放式互動 LLM-MAS 中進行了更廣泛的實驗,證明了 Corba 在複雜拓撲結構和開源模型中的有效性。我們的程式碼可在以下網址取得:https://github.com/zhrli324/Corba。 +摘要:本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型,並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素,例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解,並有助於更深入地了解疾病的進展。 -##### **Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting** -2502.14525v1 by Yannick Wölker, Arash Hajisafi, Cyrus Shahabi, Matthias Renz +##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone** +2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath -We propose a novel Graph Neural Network (GNN) model, named DeepStateGNN, for -analyzing traffic data, demonstrating its efficacy in two critical tasks: -forecasting and reconstruction. Unlike typical GNN methods that treat each -traffic sensor as an individual graph node, DeepStateGNN clusters sensors into -higher-level graph nodes, dubbed Deep State Nodes, based on various similarity -criteria, resulting in a fixed number of nodes in a Deep State graph. The term -"Deep State" nodes is a play on words, referencing hidden networks of power -that, like these nodes, secretly govern traffic independently of visible -sensors. These Deep State Nodes are defined by several similarity factors, -including spatial proximity (e.g., sensors located nearby in the road network), -functional similarity (e.g., sensors on similar types of freeways), and -behavioral similarity under specific conditions (e.g., traffic behavior during -rain). This clustering approach allows for dynamic and adaptive node grouping, -as sensors can belong to multiple clusters and clusters may evolve over time. -Our experimental results show that DeepStateGNN offers superior scalability and -faster training, while also delivering more accurate results than competitors. -It effectively handles large-scale sensor networks, outperforming other methods -in both traffic forecasting and reconstruction accuracy. +Primary care providers are vital for initial triage and referrals to +specialty care. In glaucoma, asymptomatic and fast progression can lead to +vision loss, necessitating timely referrals to specialists. However, primary +eye care providers may not identify urgent cases, potentially delaying care. +Artificial Intelligence (AI) offering explanations could enhance their referral +decisions. We investigate how various AI explanations help providers +distinguish between patients needing immediate or non-urgent specialist +referrals. We built explainable AI algorithms to predict glaucoma surgery needs +from routine eyecare data as a proxy for identifying high-risk patients. We +incorporated intrinsic and post-hoc explainability and conducted an online +study with optometrists to assess human-AI team performance, measuring referral +accuracy and analyzing interactions with AI, including agreement rates, task +time, and user experience perceptions. AI support enhanced referral accuracy +among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams +underperformed compared to AI alone. Participants believed they included AI +advice more when using the intrinsic model, and perceived it more useful and +promising. Without explanations, deviations from AI recommendations increased. +AI support did not increase workload, confidence, and trust, but reduced +challenges. On a separate test set, our black-box and intrinsic models achieved +an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We +identify opportunities of human-AI teaming for glaucoma management in primary +eye care, noting that while AI enhances referral accuracy, it also shows a +performance gap compared to AI alone, even with explanations. Human involvement +remains essential in medical decision making, underscoring the need for future +research to optimize collaboration, ensuring positive experiences and safe AI +use. -摘要:我們提出一個名為 DeepStateGNN 的新穎圖形神經網路 (GNN) 模型,用於分析交通數據,並展示其在兩個關鍵任務中的效能:預測和重建。與將每個交通感測器視為個別圖形節點的典型 GNN 方法不同,DeepStateGNN 會根據各種相似性準則將感測器群集到較高層級的圖形節點中,稱為 Deep State 節點,這會在 Deep State 圖形中產生固定數量的節點。「Deep State」節點這個術語是文字遊戲,指的是隱藏的權力網路,就像這些節點一樣,秘密地獨立於可見感測器管理交通。這些 Deep State 節點由幾個相似性因素定義,包括空間接近性(例如,位於道路網路中附近的感測器)、功能相似性(例如,位於類似類型高速公路上的感測器)以及特定條件下的行為相似性(例如,雨中的交通行為)。這種群集方法允許動態和自適應節點分組,因為感測器可以屬於多個群集,而且群集可能會隨著時間演變。我們的實驗結果顯示,DeepStateGNN 提供了卓越的可擴充性和更快的訓練速度,同時也比競爭對手提供了更準確的結果。它有效地處理了大規模感測器網路,在交通預測和重建準確度方面都優於其他方法。 +摘要:初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下,無症狀且快速惡化可能導致視力喪失,因此需要及時轉診給專家。然而,初級眼科保健提供者可能無法識別緊急情況,可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法,以從例行眼科護理資料預測青光眼手術需求,作為識別高風險患者的代理。我們納入了內在和事後解釋性,並與驗光師進行了一項線上研究,以評估人機團隊的表現,衡量轉診準確度並分析與 AI 的互動,包括同意率、任務時間和使用者體驗感知。在 87 名參與者中,AI 支援提高了轉診準確度(使用 AI/未使用的比例為 59.9%/50.8%),儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議,並認為它更有用且更有希望。沒有解釋,AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任,但減少了挑戰。在一個單獨的測試集中,我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中,人機團隊合作管理青光眼的機會,並注意到雖然 AI 提高了轉診準確度,但即使有解釋,它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要,這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。 -##### **Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation** -2502.14523v1 by Austin A. Barr, Robert Rozman, Eddie Guo +##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery** +2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang -We propose a new framework for zero-shot generation of synthetic tabular -data. Using the large language model (LLM) GPT-4o and plain-language prompting, -we demonstrate the ability to generate high-fidelity tabular data without -task-specific fine-tuning or access to real-world data (RWD) for pre-training. -To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated -synthetic data against data generated with the conditional tabular generative -adversarial network (CTGAN), across three open-access datasets: Iris, Fish -Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o -outperformed CTGAN in preserving means, 95% confidence intervals, bivariate -correlations, and data privacy of RWD, even at amplified sample sizes. Notably, -correlations between parameters were consistently preserved with appropriate -direction and strength. However, refinement is necessary to better retain -distributional characteristics. These findings highlight the potential of LLMs -in tabular data synthesis, offering an accessible alternative to generative -adversarial networks and variational autoencoders. +In medical imaging, particularly in early disease detection and prognosis +tasks, discerning the rationale behind an AI model's predictions is crucial for +evaluating the reliability of its decisions. Conventional explanation methods +face challenges in identifying discernible decisive features in medical image +classifications, where discriminative features are subtle or not immediately +apparent. To bridge this gap, we propose an explainable model that is equipped +with both decision reasoning and feature identification capabilities. Our +approach not only detects influential image patterns but also uncovers the +decisive features that drive the model's final predictions. By implementing our +method, we can efficiently identify and visualise class-specific features +leveraged by the data-driven model, providing insights into the decision-making +processes of deep learning models. We validated our model in the demanding +realm of medical prognosis task, demonstrating its efficacy and potential in +enhancing the reliability of AI in healthcare and in discovering new knowledge +in diseases where prognostic understanding is limited. + +摘要:在醫學影像中,特別是在早期疾病檢測和預後任務中,辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰,其中區別性特徵很微妙或並不明顯。為了彌合這一差距,我們提出了一個可解釋的模型,該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式,還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型,我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵,從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型,展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。 + +##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach** +2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat + +This study explores the relationship between informational support seeking +questions, responses, and helpfulness ratings in online health communities. We +created a labeled data set of question-response pairs and developed multimodal +machine learning and deep learning models to reliably predict informational +support questions and responses. We employed explainable AI to reveal the +emotions embedded in informational support exchanges, demonstrating the +importance of emotion in providing informational support. This complex +interplay between emotional and informational support has not been previously +researched. The study refines social support theory and lays the groundwork for +the development of user decision aids. Further implications are discussed. -摘要:我們提出一個新的架構,用於合成表格資料的零次學習產生。利用大型語言模型 (LLM) GPT-4o 和自然語言提示,我們證明了在沒有特定任務微調或取得真實世界資料 (RWD) 進行預訓練的情況下,產生高保真表格資料的能力。為了對 GPT-4o 進行基準測試,我們比較了 LLM 生成的合成資料與使用條件表格生成對抗網路 (CTGAN) 生成的資料在保真度和隱私性方面的表現,比較對象是三個開放取用的資料集:鳶尾花、魚類測量和房地產估價。儘管採用零次學習方法,GPT-4o 在保留平均值、95% 信賴區間、二元關聯和 RWD 的資料隱私方面都優於 CTGAN,即使在擴增的樣本大小下也是如此。值得注意的是,參數之間的關聯始終保持適當的方向和強度。然而,需要進行改進以更好地保留分佈特徵。這些發現突顯了 LLM 在表格資料合成中的潛力,為生成對抗網路和變異自動編碼器提供了可行的替代方案。 +摘要:本研究探討線上健康社群中尋求資訊支持的問題、回應,以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集,並開發了多模態機器學習和深度學習模型,以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒,證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論,並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。 -##### **MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality** -2502.14509v1 by Artur Kot, Mikołaj Koszowski, Wojciech Chojnowski, Mieszko Rutkowski, Artur Nowakowski, Kamil Guttmann, Mikołaj Pokrywka +##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education** +2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis -Does multilingual Neural Machine Translation (NMT) lead to The Curse of the -Multlinguality or provides the Cross-lingual Knowledge Transfer within a -language family? In this study, we explore multiple approaches for extending -the available data-regime in NMT and we prove cross-lingual benefits even in -0-shot translation regime for low-resource languages. With this paper, we -provide state-of-the-art open-source NMT models for translating between -selected Slavic languages. We released our models on the HuggingFace Hub -(https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) under -the CC BY 4.0 license. Slavic language family comprises morphologically rich -Central and Eastern European languages. Although counting hundreds of millions -of native speakers, Slavic Neural Machine Translation is under-studied in our -opinion. Recently, most NMT research focuses either on: high-resource languages -like English, Spanish, and German - in WMT23 General Translation Task 7 out of -8 task directions are from or to English; massively multilingual models -covering multiple language groups; or evaluation techniques. +In the era of exponential technology growth, one unexpected guest has claimed +a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as +ChatGPT, promises a revolution in education, yet it arrives with a double-edged +sword. Its potential for personalized learning is offset by issues of cheating, +inaccuracies, and educators struggling to incorporate it effectively into their +lesson design. We are standing on the brink of this educational frontier, and +it is clear that we need to navigate this terrain with a lot of care. This is a +major challenge that could undermine the integrity and value of our educational +process. So, how can we turn these challenges into opportunities? When used +inappropriately, AI tools can become the perfect tool for the cut copy paste +mentality, and quickly begin to corrode critical thinking, creativity, and deep +understanding, the most important skills in our rapidly changing world. +Teachers feel that they are not equipped to leverage this technology, widening +the digital divide among educators and institutions. Addressing these concerns +calls for an in depth research approach. We will employ empirical research, +drawing on the Technology Acceptance Model, to assess the attitudes toward +generative AI among educators and students. Understanding their perceptions, +usage patterns, and hurdles is the first crucial step in creating an effective +solution. The present study will be used as a process manual for future +researchers to apply, running their own data, based on the steps explained here -摘要:多語言神經機器翻譯 (NMT) 是否會導致多語言的詛咒,或在語言家族中提供跨語言知識轉移?在這項研究中,我們探討了多種擴展 NMT 中可用資料範圍的方法,並證明了即使在低資源語言的零次學習翻譯中也有跨語言的優點。透過這篇論文,我們提供了最先進的開源 NMT 模型,用於翻譯選定的斯拉夫語。我們在 HuggingFace Hub (https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) 下根據 CC BY 4.0 授權發布我們的模型。斯拉夫語系包含形態豐富的中歐和東歐語言。儘管擁有數億母語人士,但我們認為斯拉夫神經機器翻譯的研究不足。最近,大多數 NMT 研究都專注於:高資源語言,例如英語、西班牙語和德語 - 在 WMT23 一般翻譯任務中,8 個任務方向中有 7 個來自英語或翻譯成英語;涵蓋多個語言群組的大規模多語言模型;或評估技術。 +摘要:在科技飛速發展的時代,一位意外的訪客已在全球教室中佔有一席之地,那就是人工智慧。生成式 AI,例如 ChatGPT,承諾在教育領域掀起一場革命,但它卻是一把雙面刃。它在個人化學習方面的潛力,卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣,顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰,可能會損害我們教育過程的完整性和價值。那麼,我們如何將這些挑戰轉化為機遇?當不適當地使用時,AI 工具可能會成為複製貼上心態的完美工具,並迅速腐蝕批判性思維、創造力和深入理解,這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術,這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究,借鑑技術接受模型,來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊,根據此處說明的步驟運行他們自己的數據 -##### **Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases** -2502.14507v1 by Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau +##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data** +2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk -This study evaluates Large Language Models' (LLMs) ability to simulate -non-native-like English use observed in human second language (L2) learners -interfered with by their native first language (L1). In dialogue-based -interviews, we prompt LLMs to mimic L2 English learners with specific L1s -(e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to -real L2 learner data. Our analysis examines L1-driven linguistic biases, such -as reference word usage and avoidance behaviors, using information-theoretic -and distributional density measures. Results show that modern LLMs (e.g., -Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed -in human L2 data, with distinct influences from various languages (e.g., -Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu -influences noun-verb collocations). Our results reveal the potential of LLMs -for L2 dialogue generation and evaluation for future educational applications. +With the digitalization of health care systems, artificial intelligence +becomes more present in medicine. Especially machine learning shows great +potential for complex tasks such as time series classification, usually at the +cost of transparency and comprehensibility. This leads to a lack of trust by +humans and thus hinders its active usage. Explainable artificial intelligence +tries to close this gap by providing insight into the decision-making process, +the actual usefulness of its different methods is however unclear. This paper +proposes a user study based evaluation of the explanation method Grad-CAM with +application to a neural network for the classification of breaths in time +series neonatal ventilation data. We present the perceived usefulness of the +explainability method by different stakeholders, exposing the difficulty to +achieve actual transparency and the wish for more in-depth explanations by many +of the participants. -摘要:本研究評估大型語言模型 (LLM) 模擬非母語英語使用者的能力,這些使用者會受到母語 (L1) 干擾,而母語是第二語言 (L2) 學習者。在基於對話的訪談中,我們提示 LLM 模仿具有特定 L1(例如日語、泰語、烏爾都語)的 L2 英語學習者,並比較七種語言的輸出與真實的 L2 學習者資料。我們的分析使用資訊理論和分佈密度測量來檢視 L1 驅動的語言偏差,例如參考詞使用和避免行為。結果顯示,現代 LLM(例如 Qwen2.5、LLAMA3.3、DeepseekV3、GPT-4o)複製了在人類 L2 資料中觀察到的 L1 相依模式,並受到各種語言的明顯影響(例如,日語、韓語和普通話顯著影響時態一致性,而烏爾都語影響名詞動詞搭配)。我們的結果揭示了 LLM 在 L2 對話產生和評估方面的潛力,可供未來教育應用使用。 +摘要:隨著醫療保健系統的數位化,人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力,但通常是以透明度和可理解性為代價。這導致人類缺乏信任,從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距,但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估,其中包含了 Grad-CAM 解釋方法,並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用,揭示了實現實際透明度的難度,以及許多參與者希望獲得更深入的解釋。 -##### **PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models** -2502.14504v1 by Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang +##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare** +2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio -Large Vision-Language Models (LVLMs) have demonstrated remarkable -capabilities across a range of multimodal tasks. However, their inference -efficiency is constrained by the large number of visual tokens processed during -decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token -Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level -Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the -Vision Token Re-attention phenomenon across decoder layers, we dynamically -adjust token retention rates layer by layer. Layers that exhibit stronger -attention to visual information preserve more vision tokens, while layers with -lower vision attention are aggressively pruned. Furthermore, PLPHP applies -pruning at the attention head level, enabling different heads within the same -layer to independently retain critical context. Experiments on multiple -benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and -reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of -0.46% average performance drop, while also achieving notable performance -improvements in multi-image tasks. These results highlight the effectiveness of -fine-grained token pruning and contribute to advancing the efficiency and -scalability of LVLMs. Our source code will be made publicly available. +The integration of Large Language Models (LLMs) into healthcare diagnostics +offers a promising avenue for clinical decision-making. This study outlines the +development of a novel method for zero-shot/few-shot in-context learning (ICL) +by integrating medical domain knowledge using a multi-layered structured +prompt. We also explore the efficacy of two communication styles between the +user and LLMs: the Numerical Conversational (NC) style, which processes data +incrementally, and the Natural Language Single-Turn (NL-ST) style, which +employs long narrative prompts. + Our study systematically evaluates the diagnostic accuracy and risk factors, +including gender bias and false negative rates, using a dataset of 920 patient +records in various few-shot scenarios. Results indicate that traditional +clinical machine learning (ML) models generally outperform LLMs in zero-shot +and few-shot settings. However, the performance gap narrows significantly when +employing few-shot examples alongside effective explainable AI (XAI) methods as +sources of domain knowledge. Moreover, with sufficient time and an increased +number of examples, the conversational style (NC) nearly matches the +performance of ML models. Most notably, LLMs demonstrate comparable or superior +cost-sensitive accuracy relative to ML models. + This research confirms that, with appropriate domain knowledge and tailored +communication strategies, LLMs can significantly enhance diagnostic processes. +The findings highlight the importance of optimizing the number of training +examples and communication styles to improve accuracy and reduce biases in LLM +applications. -摘要:大型視覺語言模型 (LVLMs) 已在各種多模態任務中展現出非凡的能力。然而,其推理效率受到解碼過程中處理的大量視覺符號的限制。為了應對這一挑戰,我們提出逐層逐頭視覺符號剪枝 (PLPHP),這是一種包括層級保留率分配和頭級視覺符號剪枝的兩級細粒度剪枝方法。受解碼器層中視覺符號重新關注現象的啟發,我們動態地逐層調整符號保留率。對視覺資訊表現出更強關注力的層保留更多視覺符號,而視覺關注力較低的層則被積極剪枝。此外,PLPHP 在關注頭級別應用剪枝,使同一層中的不同頭部可以獨立保留關鍵上下文。在多個基準測試上的實驗表明,PLPHP 的解碼速度提高了 18%,且將鍵值快取 (KV 快取) 大小減少了 50% 以上,而代價僅為平均效能下降 0.46%,同時還在多影像任務中實現了顯著的效能提升。這些結果突顯了細粒度符號剪枝的有效性,並有助於提升 LVLMs 的效率和可擴充性。我們的原始碼將公開提供。 +摘要:大型語言模型 (LLM) 與醫療診斷整合 +為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發,用於零次學習/少量學習情境學習 (ICL),方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效:數值對話 (NC) 方式,它會逐步處理資料,以及自然語言單回合 (NL-ST) 方式,它會使用長篇敘事提示。 +我們的研究系統性地評估了診斷準確性和風險因子,包括性別偏見和假陰性率,使用了一個包含 920 個患者記錄的資料集,採用各種少量學習情境。結果表明,傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而,當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時,效能差距會顯著縮小。此外,隨著時間充足和範例數量增加,對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是,LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。 +本研究證實,透過適當的領域知識和量身打造的溝通策略,LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性,以提高準確度並減少 LLM 應用中的偏差。 -##### **How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?** -2502.14502v1 by Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov +##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems** +2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho -The performance of Large Language Models (LLMs) on many tasks is greatly -limited by the knowledge learned during pre-training and stored in the model's -parameters. Low-rank adaptation (LoRA) is a popular and efficient training -technique for updating or domain-specific adaptation of LLMs. In this study, we -investigate how new facts can be incorporated into the LLM using LoRA without -compromising the previously learned knowledge. We fine-tuned -Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our -experiments have shown that the best results are obtained when the training -data contains a mixture of known and new facts. However, this approach is still -potentially harmful because the model's performance on external -question-answering benchmarks declines after such fine-tuning. When the -training data is biased towards certain entities, the model tends to regress to -few overrepresented answers. In addition, we found that the model becomes more -confident and refuses to provide an answer in only few cases. These findings -highlight the potential pitfalls of LoRA-based LLM updates and underscore the -importance of training data composition and tuning parameters to balance new -knowledge integration and general model capabilities. +The increasing reliance on Deep Learning models, combined with their inherent +lack of transparency, has spurred the development of a novel field of study +known as eXplainable AI (XAI) methods. These methods seek to enhance the trust +of end-users in automated systems by providing insights into the rationale +behind their decisions. This paper presents a novel approach for measuring user +trust in XAI systems, allowing their refinement. Our proposed metric combines +both performance metrics and trust indicators from an objective perspective. To +validate this novel methodology, we conducted a case study in a realistic +medical scenario: the usage of XAI system for the detection of pneumonia from +x-ray images. -摘要:大型語言模型 (LLM) 在許多任務上的表現受到預訓練期間學到的知識和儲存在模型參數中的知識的極大限制。低階適應 (LoRA) 是一種流行且有效的訓練技術,用於更新或 LLM 的特定領域適應。在這項研究中,我們探討如何使用 LoRA 將新事實納入 LLM,同時不損害先前學到的知識。我們使用不同數量的知識微調 Llama-3.1-8B-instruct。我們的實驗表明,當訓練資料包含已知和新事實的混合時,會獲得最佳結果。然而,這種方法仍然具有潛在的危害性,因為模型在外部問答基準上的表現會在這種微調後下降。當訓練資料偏向於某些實體時,模型傾向於回歸到少數過度表示的答案。此外,我們發現模型變得更有信心,並且在極少數情況下拒絕提供答案。這些發現突顯了基於 LoRA 的 LLM 更新的潛在缺點,並強調了訓練資料組成和調整參數以平衡新知識整合和一般模型能力的重要性。 +摘要:隨著對深度學習模型依賴性的增加,加上其固有的透明度不足,促使一個新的研究領域發展,稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理,來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法,允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法,我們在一個真實的醫療場景中進行了一個案例研究:使用 XAI 系統從 X 光影像中偵測肺炎。 -##### **Towards a Perspectivist Turn in Argument Quality Assessment** -2502.14501v1 by Julia Romberg, Maximilian Maurer, Henning Wachsmuth, Gabriella Lapesa +##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19** +2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao -The assessment of argument quality depends on well-established logical, -rhetorical, and dialectical properties that are unavoidably subjective: -multiple valid assessments may exist, there is no unequivocal ground truth. -This aligns with recent paths in machine learning, which embrace the -co-existence of different perspectives. However, this potential remains largely -unexplored in NLP research on argument quality. One crucial reason seems to be -the yet unexplored availability of suitable datasets. We fill this gap by -conducting a systematic review of argument quality datasets. We assign them to -a multi-layered categorization targeting two aspects: (a) What has been -annotated: we collect the quality dimensions covered in datasets and -consolidate them in an overarching taxonomy, increasing dataset comparability -and interoperability. (b) Who annotated: we survey what information is given -about annotators, enabling perspectivist research and grounding our -recommendations for future actions. To this end, we discuss datasets suitable -for developing perspectivist models (i.e., those containing individual, -non-aggregated annotations), and we showcase the importance of a controlled -selection of annotators in a pilot study. +The COVID-19 pandemic has strained global public health, necessitating +accurate diagnosis and intervention to control disease spread and reduce +mortality rates. This paper introduces an interpretable deep survival +prediction model designed specifically for improved understanding and trust in +COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale +pretrained image encoder, Risk-specific Grad-CAM, and anatomical region +detection techniques, our approach produces regional interpretable outcomes +that effectively capture essential disease features while focusing on rare but +critical abnormal regions. Our model's predictive results provide enhanced +clarity and transparency through risk area localization, enabling clinicians to +make informed decisions regarding COVID-19 diagnosis with better understanding +of prognostic insights. We evaluate the proposed method on a multi-center +survival dataset and demonstrate its effectiveness via quantitative and +qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and +time-dependent AUCs (0.799 and 0.691). These results suggest that our +explainable deep survival prediction model surpasses traditional survival +analysis methods in risk prediction, improving interpretability for clinical +decision making and enhancing AI system trustworthiness. -摘要:論證品質的評估取決於根深蒂固的邏輯、修辭和辯證屬性,這些屬性難免具有主觀性:可能存在多種有效的評估,沒有明確的真實依據。這與機器學習中最近的途徑一致,這些途徑接受了不同觀點的共存。然而,這種潛力在論證品質的 NLP 研究中仍然很大程度上未被探索。一個關鍵原因似乎是尚未探索合適的資料集的可用性。我們通過對論證品質資料集進行系統性回顧來填補這一空白。我們將它們分配到一個多層次分類,針對兩個方面:(a) 已註釋的內容:我們收集資料集中涵蓋的品質維度,並將它們整合到一個總體分類法中,提高資料集的可比性和互操作性。(b) 誰做了註釋:我們調查了關於註釋者的哪些資訊,使觀點主義研究成為可能,並為我們對未來行動的建議奠定基礎。為此,我們討論了適合開發觀點主義模型的資料集(即那些包含個別、非聚合註釋的資料集),並在試驗研究中展示了受控選擇註釋者的重要性。 +摘要:COVID-19 疫情對全球公共衛生造成壓力,必須進行準確的診斷和干預,以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型,專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術,我們的做法產生區域可解釋的結果,有效捕捉必要的疾病特徵,同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度,讓臨床醫生能夠在更了解預後見解的情況下,就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法,並透過量化和質化評估證明其有效性,達到優異的 C 指數(0.764 和 0.727)和時間相關 AUC(0.799 和 0.691)。這些結果表明,我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法,提升臨床決策的解釋性,並增強 AI 系統的信賴度。 -##### **MLGym: A New Framework and Benchmark for Advancing AI Research Agents** -2502.14499v1 by Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu +##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics** +2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile -We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for -evaluating and developing LLM agents on AI research tasks. This is the first -Gym environment for machine learning (ML) tasks, enabling research on -reinforcement learning (RL) algorithms for training such agents. MLGym-bench -consists of 13 diverse and open-ended AI research tasks from diverse domains -such as computer vision, natural language processing, reinforcement learning, -and game theory. Solving these tasks requires real-world AI research skills -such as generating new ideas and hypotheses, creating and processing data, -implementing ML methods, training models, running experiments, analyzing the -results, and iterating through this process to improve on a given task. We -evaluate a number of frontier large language models (LLMs) on our benchmarks -such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 -Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate -models or agents, generate synthetic data at scale, as well as develop new -learning algorithms for training agents on AI research tasks. We find that -current frontier models can improve on the given baselines, usually by finding -better hyperparameters, but do not generate novel hypotheses, algorithms, -architectures, or substantial improvements. We open-source our framework and -benchmark to facilitate future research in advancing the AI research -capabilities of LLM agents. +In recent years, machine learning-based clinical decision support systems +(CDSS) have played a key role in the analysis of several medical conditions. +Despite their promising capabilities, the lack of transparency in AI models +poses significant challenges, particularly in medical contexts where +reliability is a mandatory aspect. However, it appears that explainability is +inversely proportional to accuracy. For this reason, achieving transparency +without compromising predictive accuracy remains a key challenge. This paper +presents a novel method, namely Rad4XCNN, to enhance the predictive power of +CNN-derived features with the inherent interpretability of radiomic features. +Rad4XCNN diverges from conventional methods based on saliency maps, by +associating intelligible meaning to CNN-derived features by means of Radiomics, +offering new perspectives on explanation methods beyond visualization maps. +Using a breast cancer classification task as a case study, we evaluated +Rad4XCNN on ultrasound imaging datasets, including an online dataset and two +in-house datasets for internal and external validation. Some key results are: +i) CNN-derived features guarantee more robust accuracy when compared against +ViT-derived and radiomic features; ii) conventional visualization map methods +for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice +model accuracy for their explainability; iv) Rad4XCNN provides a global +explanation enabling the physician to extract global insights and findings. Our +method can mitigate some concerns related to the explainability-accuracy +trade-off. This study highlighted the importance of proposing new methods for +model explanation without affecting their accuracy. + +摘要:近年来,基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景,但 AI 模型缺乏透明度,尤其在医疗领域,可靠性是强制性方面,这带来了重大挑战。然而,解释性似乎与准确性成反比。因此,在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法,即 Rad4XCNN,以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来,从而偏离了基于显着性图的传统方法,为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究,我们在超声成像数据集上评估了 Rad4XCNN,包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是:i) 与 ViT 衍生和放射组学特征相比,CNN 衍生特征保证了更稳健的准确性;ii) 用于解释的传统可视化图方法存在一些缺陷;iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性;iv) Rad4XCNN 提供全局解释,使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。 + +##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability** +2404.16957v1 by Yunfei Ge, Quanyan Zhu + +The pervasive integration of Artificial Intelligence (AI) has introduced +complex challenges in the responsibility and accountability in the event of +incidents involving AI-enabled systems. The interconnectivity of these systems, +ethical concerns of AI-induced incidents, coupled with uncertainties in AI +technology and the absence of corresponding regulations, have made traditional +responsibility attribution challenging. To this end, this work proposes a +Computational Reflective Equilibrium (CRE) approach to establish a coherent and +ethically acceptable responsibility attribution framework for all stakeholders. +The computational approach provides a structured analysis that overcomes the +limitations of conceptual approaches in dealing with dynamic and multifaceted +scenarios, showcasing the framework's explainability, coherence, and adaptivity +properties in the responsibility attribution process. We examine the pivotal +role of the initial activation level associated with claims in equilibrium +computation. Using an AI-assisted medical decision-support system as a case +study, we illustrate how different initializations lead to diverse +responsibility distributions. The framework offers valuable insights into +accountability in AI-induced incidents, facilitating the development of a +sustainable and resilient system through continuous monitoring, revision, and +reflection. -摘要:我們推出 Meta MLGym 和 MLGym-Bench,一個用於評估和開發 AI 研究任務中 LLM 代理的新架構和基準。這是第一個用於機器學習 (ML) 任務的 Gym 環境,可針對訓練此類代理的強化學習 (RL) 演算法進行研究。MLGym-bench 包含 13 項來自不同領域的開放式 AI 研究任務,例如電腦視覺、自然語言處理、強化學習和博弈論。解決這些任務需要實際的 AI 研究技能,例如產生新想法和假設、建立和處理資料、實作 ML 方法、訓練模型、執行實驗、分析結果,並透過此流程反覆運算來改善特定任務。我們在基準上評估許多前沿大型語言模型 (LLM),例如 Claude-3.5-Sonnet、Llama-3.1 405B、GPT-4o、o1-preview 和 Gemini-1.5 Pro。我們的 MLGym 架構讓新增任務、整合和評估模型或代理、大規模產生合成資料,以及開發新的學習演算法以訓練 AI 研究任務中的代理變得容易。我們發現目前的邊界模型可以改善既定的基準,通常是透過尋找更好的超參數,但不會產生新穎的假設、演算法、架構或實質性的改進。我們開放原始碼架構和基準,以促進未來在提升 LLM 代理的 AI 研究能力方面的研究。 +摘要:隨著人工智慧 (AI) 的普及整合,在涉及 AI 驅動系統的事故中,責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題,加上 AI 技術的不確定性和缺乏相應法規,使得傳統責任歸屬面臨挑戰。為此,本研究提出了一種計算反思均衡 (CRE) 方法,以建立一個連貫且在倫理上可接受的責任歸屬架構,適用於所有利害關係人。計算方法提供了結構化的分析,克服了概念方法在處理動態且多面向情境時的限制,展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究,說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解,透過持續監控、修訂和反思,促進了永續且有韌性的系統發展。 -##### **Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups** -2502.14497v1 by Felix Drinkall, Stefan Zohren, Michael McMahon, Janet B. Pierrehumbert +##### **Explainable AI for Fair Sepsis Mortality Predictive Model** +2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang -Macroeconomic fluctuations and the narratives that shape them form a mutually -reinforcing cycle: public discourse can spur behavioural changes leading to -economic shifts, which then result in changes in the stories that propagate. We -show that shifts in semantic embedding space can be causally linked to -financial market shocks -- deviations from the expected market behaviour. -Furthermore, we show how partisanship can influence the predictive power of -text for market fluctuations and shape reactions to those same shocks. We also -provide some evidence that text-based signals are particularly salient during -unexpected events such as COVID-19, highlighting the value of language data as -an exogenous variable in economic forecasting. Our findings underscore the -bidirectional relationship between news outlets and market shocks, offering a -novel empirical approach to studying their effect on each other. +Artificial intelligence supports healthcare professionals with predictive +modeling, greatly transforming clinical decision-making. This study addresses +the crucial need for fairness and explainability in AI applications within +healthcare to ensure equitable outcomes across diverse patient demographics. By +focusing on the predictive modeling of sepsis-related mortality, we propose a +method that learns a performance-optimized predictive model and then employs +the transfer learning process to produce a model with better fairness. Our +method also introduces a novel permutation-based feature importance algorithm +aiming at elucidating the contribution of each feature in enhancing fairness on +predictions. Unlike existing explainability methods concentrating on explaining +feature contribution to predictive performance, our proposed method uniquely +bridges the gap in understanding how each feature contributes to fairness. This +advancement is pivotal, given sepsis's significant mortality rate and its role +in one-third of hospital deaths. Our method not only aids in identifying and +mitigating biases within the predictive model but also fosters trust among +healthcare stakeholders by improving the transparency and fairness of model +predictions, thereby contributing to more equitable and trustworthy healthcare +delivery. -摘要:宏觀經濟波動與形塑它們的敘事形成一個相互強化的循環:公共論述可能激發導致經濟變化的行為改變,進而導致宣傳故事的改變。我們表明,語義嵌入空間的轉變可能與金融市場震盪(與預期的市場行為的偏差)有因果關係。此外,我們展示了黨派立場如何影響文字對市場波動的預測能力,以及如何形塑對這些震盪的反應。我們還提供了一些證據,證明在 COVID-19 等意外事件期間,基於文字的信號特別顯著,突顯了語言資料在經濟預測中作為外生變數的價值。我們的研究結果強調了新聞媒體與市場震盪之間的雙向關係,提供了一種研究它們對彼此影響的新穎實證方法。 +摘要:人工智慧透過預測模型協助醫療專業人員,大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求,以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型,我們提出了一種方法,該方法會學習一個效能最佳化的預測模型,然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法,旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同,我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要,因為敗血症的死亡率很高,且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差,還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任,進而有助於提供更公平且值得信賴的醫療保健服務。 -##### **Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization** -2502.14496v1 by Zhitao He, Zijun Liu, Peng Li, May Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu +##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence** +2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal -LLM-based agents have made significant advancements in interactive -environments, such as mobile operations and web browsing, and other domains -beyond computer using. Current multi-agent systems universally excel in -performance, compared to single agents, but struggle with generalization across -environments due to predefined roles and inadequate strategies for generalizing -language agents. The challenge of achieving both strong performance and good -generalization has hindered the progress of multi-agent systems for interactive -environments. To address these issues, we propose CollabUIAgents, a multi-agent -reinforcement learning framework with a novel multi-agent credit re-assignment -(CR) strategy, assigning process rewards with LLMs rather than -environment-specific rewards and learning with synthesized preference data, in -order to foster generalizable, collaborative behaviors among the role-free -agents' policies. Empirical results show that our framework improves both -performance and cross-environment generalizability of multi-agent systems. -Moreover, our 7B-parameter system achieves results on par with or exceed strong -closed-source models, and the LLM that guides the CR. We also provide insights -in using granular CR rewards effectively for environment generalization, and -accommodating trained LLMs in multi-agent systems. +Depression is a significant issue nowadays. As per the World Health +Organization (WHO), in 2023, over 280 million individuals are grappling with +depression. This is a huge number; if not taken seriously, these numbers will +increase rapidly. About 4.89 billion individuals are social media users. People +express their feelings and emotions on platforms like Twitter, Facebook, +Reddit, Instagram, etc. These platforms contain valuable information which can +be used for research purposes. Considerable research has been conducted across +various social media platforms. However, certain limitations persist in these +endeavors. Particularly, previous studies were only focused on detecting +depression and the intensity of depression in tweets. Also, there existed +inaccuracies in dataset labeling. In this research work, five types of +depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted +using tweets from the Twitter database based on lexicon labeling. Explainable +AI was used to provide reasoning by highlighting the parts of tweets that +represent type of depression. Bidirectional Encoder Representations from +Transformers (BERT) was used for feature extraction and training. Machine +learning and deep learning methodologies were used to train the model. The BERT +model presented the most promising results, achieving an overall accuracy of +0.96. -摘要:基於 LLM 的代理在互動式環境中取得重大進展,例如行動運算和網頁瀏覽,以及電腦使用以外的其他領域。與單一代理相比,目前的 Multi-Agent 系統在效能上普遍表現出色,但由於預先定義的角色和不適當的語言代理概化策略,導致難以跨環境概化。在互動式環境中,同時達成強大效能和良好概化的挑戰,阻礙了 Multi-Agent 系統的進展。為了解決這些問題,我們提出 CollabUIAgents,這是一個 Multi-Agent 強化學習架構,具備創新的 Multi-Agent 信用重新分配 (CR) 策略,使用 LLM 而不是特定於環境的獎勵來分配程序獎勵,並透過綜合偏好資料進行學習,以促進無角色代理政策之間可概化的協作行為。經驗結果顯示,我們的架構同時改善了 Multi-Agent 系統的效能和跨環境概化能力。此外,我們的 7B 參數系統在效能上與強大的閉源模型和引導 CR 的 LLM 相當或超越它們。我們也提供見解,說明如何有效地使用細粒化的 CR 獎勵來進行環境概化,以及如何在 Multi-Agent 系統中容納受過訓練的 LLM。 +摘要:現今,憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料,在 2023 年,超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字;如果不認真看待,這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊,可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而,這些努力仍存在某些限制。特別是,先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外,資料集標籤中存在不準確的情況。在這項研究工作中,使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症(雙極型、重度、精神病型、非典型和產後)。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers(BERT)中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果,達到 0.96 的整體準確度。 -##### **StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following** -2502.14494v1 by Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu +##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images** +2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman -Multi-turn instruction following capability constitutes a core competency of -large language models (LLMs) in real-world applications. Existing evaluation -benchmarks predominantly focus on fine-grained constraint satisfaction and -domain-specific capability assessment, yet overlook the crucial structural -dependency between dialogue turns that distinguishes multi-turn from -single-turn interactions. This structural dependency not only reflects user -intent but also establishes a second dimension for instruction following -evaluation beyond constraint satisfaction. To address this gap, we propose -StructFlowBench, a multi-turn instruction following benchmark with structural -flow modeling. The benchmark innovatively defines a structural flow framework -comprising six fundamental inter-turn relationships, which not only introduces -novel structural constraints for model evaluation but also serves as generation -parameters for creating customized dialogue flows tailored to specific -scenarios. Adopting established LLM-based automatic evaluation methodologies, -we conduct systematic evaluations of 13 leading open-source and closed-source -LLMs. Experimental results reveal significant deficiencies in current models' -comprehension of multi-turn dialogue structures. The code is available at -\url{https://github.com/MLGroupJLU/StructFlowBench}. +Deep learning is dramatically transforming the field of medical imaging and +radiology, enabling the identification of pathologies in medical images, +including computed tomography (CT) and X-ray scans. However, the performance of +deep learning models, particularly in segmentation tasks, is often limited by +the need for extensive annotated datasets. To address this challenge, the +capabilities of weakly supervised semantic segmentation are explored through +the lens of Explainable AI and the generation of counterfactual explanations. +The scope of this research is development of a novel counterfactual inpainting +approach (COIN) that flips the predicted classification label from abnormal to +normal by using a generative model. For instance, if the classifier deems an +input medical image X as abnormal, indicating the presence of a pathology, the +generative model aims to inpaint the abnormal region, thus reversing the +classifier's original prediction label. The approach enables us to produce +precise segmentations for pathologies without depending on pre-existing +segmentation masks. Crucially, image-level labels are utilized, which are +substantially easier to acquire than creating detailed segmentation masks. The +effectiveness of the method is demonstrated by segmenting synthetic targets and +actual kidney tumors from CT images acquired from Tartu University Hospital in +Estonia. The findings indicate that COIN greatly surpasses established +attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an +alternative counterfactual explanation method introduced by Singla et al. This +evidence suggests that COIN is a promising approach for semantic segmentation +of tumors in CT images, and presents a step forward in making deep learning +applications more accessible and effective in healthcare, where annotated data +is scarce. -摘要:多輪指令遵循能力構成大型語言模型 (LLM) 在現實世界應用中的核心能力。現有的評估基準主要專注於細粒度的約束滿足和特定領域的能力評估,卻忽略了多輪與單輪互動之間區別的關鍵結構依賴性。這種結構依賴性不僅反映了使用者的意圖,也為指令遵循評估建立了超越約束滿足的第二個維度。為了解決這個差距,我們提出了 StructFlowBench,一個具有結構流建模的多輪指令遵循基準。該基準創新地定義了一個結構流框架,包含六個基本的回合間關係,這不僅引入了模型評估的新結構約束,還可用作生成參數,用於創建針對特定場景定制的對話流。採用已建立的基於 LLM 的自動評估方法,我們對 13 個領先的開源和閉源 LLM 進行了系統評估。實驗結果揭示了當前模型在理解多輪對話結構方面存在顯著缺陷。程式碼可在 \url{https://github.com/MLGroupJLU/StructFlowBench} 取得。 +摘要:深度学习正大幅轉變醫學影像和放射線學領域,能辨識醫學影像中的病理,包括電腦斷層掃描 (CT) 和 X 光掃描。然而,深度學習模型的效能,特別是在分割任務中,常常受到廣泛註解資料集需求的限制。為了應對此挑戰,透過可解釋 AI 和反事實解釋的產生,探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN),該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如,如果分類器將輸入的醫學影像 X 視為異常,表示存在病理,則生成模型旨在內插異常區域,從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割,而無需依賴於預先存在的分割遮罩。至關重要的是,利用影像層級標籤,這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明,COIN 遠遠超過已建立的歸因方法,例如 RISE、ScoreCAM 和 LayerCAM,以及 Singla 等人提出的另一種反事實解釋方法。此證據表明,COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法,並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步,其中註解資料很稀少。 -##### **Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk** -2502.14491v1 by Elija Perrier +##### **Hybrid Intelligence for Digital Humanities** +2406.15374v1 by Victor de Boer, Lise Stork -Evaluating AI safety requires statistically rigorous methods and risk metrics -for understanding how the use of AI affects aggregated risk. However, much AI -safety literature focuses upon risks arising from AI models in isolation, -lacking consideration of how modular use of AI affects risk distribution of -workflow components or overall risk metrics. There is also a lack of -statistical grounding enabling sensitisation of risk models in the presence of -absence of AI to estimate causal contributions of AI. This is in part due to -the dearth of AI impact data upon which to fit distributions. In this work, we -address these gaps in two ways. First, we demonstrate how scenario modelling -(grounded in established statistical techniques such as Markov chains, copulas -and Monte Carlo simulation) can be used to model AI risk holistically. Second, -we show how lookalike distributions from phenomena analogous to AI can be used -to estimate AI impacts in the absence of directly observable data. We -demonstrate the utility of our methods for benchmarking cumulative AI risk via -risk analysis of a logistic scenario simulations. +In this paper, we explore the synergies between Digital Humanities (DH) as a +discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research, +the use of digital methods and specifically that of Artificial Intelligence is +subject to a set of requirements and constraints. We argue that these are +well-supported by the capabilities and goals of HI. Our contribution includes +the identification of five such DH requirements: Successful AI systems need to +be able to 1) collaborate with the (human) scholar; 2) support data criticism; +3) support tool criticism; 4) be aware of and cater to various perspectives and +5) support distant and close reading. We take the CARE principles of Hybrid +Intelligence (collaborative, adaptive, responsible and explainable) as +theoretical framework and map these to the DH requirements. In this mapping, we +include example research projects. We finally address how insights from DH can +be applied to HI and discuss open challenges for the combination of the two +disciplines. -摘要:評估 AI 安全性需要嚴格的統計方法和風險指標,以了解 AI 的使用如何影響累積風險。然而,許多 AI 安全性文獻著重於 AI 模型孤立產生的風險,缺乏考量 AI 的模組化使用如何影響工作流程組件的風險分佈或整體風險指標。在有或沒有 AI 的情況下,統計基礎也缺乏讓風險模型敏感化的能力,以估計 AI 的因果關係貢獻。這部分是因為缺乏 AI 影響資料來擬合分佈。在這項研究中,我們以兩種方式解決這些差距。首先,我們展示情境建模(建立在已建立的統計技術上,例如馬可夫鏈、copula 和蒙地卡羅模擬)如何用於整體建模 AI 風險。其次,我們展示如何使用類似於 AI 現象的相似分佈來估計在沒有直接可觀察資料的情況下 AI 的影響。我們透過後勤情境模擬的風險分析,展示了我們的方法對於評量累積 AI 風險的效用。 +摘要:在本文中,我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中,數位方法的使用,特別是人工智慧的使用,受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求:成功的 AI 系統需要能夠 1) 與(人類)學者合作;2) 支援資料批評;3) 支援工具批評;4) 察覺並迎合各種觀點;5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則(協作、適應、負責和可解釋)作為理論架構,並將這些原則對應到 DH 要求。在此對應中,我們納入範例研究專案。最後,我們探討如何將 DH 的見解應用於 HI,並討論結合這兩個學科的開放挑戰。 -##### **Temporal Misalignment and Probabilistic Neurons** -2502.14487v1 by Velibor Bojković, Xiaofeng Wu, Bin Gu +##### **Ethical Framework for Responsible Foundational Models in Medical Imaging** +2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci -Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to -Artificial Neural Networks (ANNs) by mimicking biological neural principles, -establishing them as a promising approach to mitigate the increasing energy -demands of large-scale neural models. However, fully harnessing the -capabilities of SNNs remains challenging due to their discrete signal -processing and temporal dynamics. ANN-SNN conversion has emerged as a practical -approach, enabling SNNs to achieve competitive performance on complex machine -learning tasks. In this work, we identify a phenomenon in the ANN-SNN -conversion framework, termed temporal misalignment, in which random spike -rearrangement across SNN layers leads to performance improvements. Based on -this observation, we introduce biologically plausible two-phase probabilistic -(TPP) spiking neurons, further enhancing the conversion process. We demonstrate -the advantages of our proposed method both theoretically and empirically -through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet -across a variety of architectures, achieving state-of-the-art results. +Foundational models (FMs) have tremendous potential to revolutionize medical +imaging. However, their deployment in real-world clinical settings demands +extensive ethical considerations. This paper aims to highlight the ethical +concerns related to FMs and propose a framework to guide their responsible +development and implementation within medicine. We meticulously examine ethical +issues such as privacy of patient data, bias mitigation, algorithmic +transparency, explainability and accountability. The proposed framework is +designed to prioritize patient welfare, mitigate potential risks, and foster +trust in AI-assisted healthcare. -摘要:脈衝神經網路 (SNN) 模仿生物神經原理,提供了一種比人工神經網路 (ANN) 更省能的替代方案,確立了它們作為緩解大型神經模型日益增長能耗需求的一種有前途的方法。然而,由於 SNN 的離散訊號處理和時間動態,要充分利用 SNN 的功能仍然具有挑戰性。ANN-SNN 轉換已經成為一種實用的方法,使 SNN 能夠在複雜機器學習任務中實現競爭性能。在這項工作中,我們在 ANN-SNN 轉換框架中發現了一種現象,稱為時間錯位,其中隨機脈衝在 SNN 層之間重新排列會導致性能提升。基於這一觀察,我們引入了生物學上合理的兩階段機率 (TPP) 脈衝神經元,進一步增強了轉換過程。我們通過在 CIFAR-10/100、CIFAR10-DVS 和 ImageNet 上對各種架構進行綜合實驗,從理論和經驗上證明了我們提出的方法的優點,取得了最先進的結果。 +摘要:基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而,它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題,並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題,例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險,並培養對 AI 輔助醫療保健的信任。 -##### **How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation** -2502.14486v1 by Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang, Zhongyu Wei +##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis** +2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak -Jailbreak attacks, where harmful prompts bypass generative models' built-in -safety, raise serious concerns about model vulnerability. While many defense -methods have been proposed, the trade-offs between safety and helpfulness, and -their application to Large Vision-Language Models (LVLMs), are not well -understood. This paper systematically examines jailbreak defenses by reframing -the standard generation task as a binary classification problem to assess model -refusal tendencies for both harmful and benign queries. We identify two key -defense mechanisms: safety shift, which increases refusal rates across all -queries, and harmfulness discrimination, which improves the model's ability to -distinguish between harmful and benign inputs. Using these mechanisms, we -develop two ensemble defense strategies-inter-mechanism ensembles and -intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the -MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these -strategies effectively improve model safety or optimize the trade-off between -safety and helpfulness. +Thyroid cancer is an increasing global health concern that requires advanced +diagnostic methods. The application of AI and radiomics to thyroid cancer +diagnosis is examined in this review. A review of multiple databases was +conducted in compliance with PRISMA guidelines until October 2023. A +combination of keywords led to the discovery of an English academic publication +on thyroid cancer and related subjects. 267 papers were returned from the +original search after 109 duplicates were removed. Relevant studies were +selected according to predetermined criteria after 124 articles were eliminated +based on an examination of their abstract and title. After the comprehensive +analysis, an additional six studies were excluded. Among the 28 included +studies, radiomics analysis, which incorporates ultrasound (US) images, +demonstrated its effectiveness in diagnosing thyroid cancer. Various results +were noted, some of the studies presenting new strategies that outperformed the +status quo. The literature has emphasized various challenges faced by AI +models, including interpretability issues, dataset constraints, and operator +dependence. The synthesized findings of the 28 included studies mentioned the +need for standardization efforts and prospective multicenter studies to address +these concerns. Furthermore, approaches to overcome these obstacles were +identified, such as advances in explainable AI technology and personalized +medicine techniques. The review focuses on how AI and radiomics could transform +the diagnosis and treatment of thyroid cancer. Despite challenges, future +research on multidisciplinary cooperation, clinical applicability validation, +and algorithm improvement holds the potential to improve patient outcomes and +diagnostic precision in the treatment of thyroid cancer. -摘要:越獄攻擊,其中有害提示繞過生成模型內建的安全機制,引發了對模型漏洞的嚴重疑慮。雖然已提出許多防禦方法,但安全性與有益性之間的取捨,以及它們在大型視覺語言模型 (LVLMs) 中的應用,尚未得到充分理解。本文透過將標準生成任務重新定義為二元分類問題,系統性地檢視越獄防禦,以評估模型對有害和良性查詢的拒絕傾向。我們找出兩種關鍵的防禦機制:安全轉移,這會提高所有查詢的拒絕率,以及危害區分,這會提升模型區分有害和良性輸入的能力。使用這些機制,我們開發出兩種整體防禦策略,機制間整體和機制內整體,以平衡安全性與有益性。在使用 LLaVA-1.5 模型的 MM-SafetyBench 和 MOSSBench 資料集上進行的實驗顯示,這些策略有效地提升了模型安全性,或最佳化了安全性與有益性之間的取捨。 +摘要:甲狀腺癌是一種日益嚴重的全球健康問題,需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下,對多個資料庫進行了回顧,直到 2023 年 10 月。通過結合關鍵字,發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後,原始搜尋共回傳 267 篇論文。在根據預先確定的標準,淘汰了 124 篇文章的摘要和標題後,選出了相關研究。在進行全面分析後,額外排除了六項研究。在納入的 28 項研究中,結合超音波 (US) 影像的放射特徵分析,證明了其在診斷甲狀腺癌方面的有效性。研究結果不一,有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰,包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到,需要標準化工作和前瞻性多中心研究來解決這些問題。此外,還確定了克服這些障礙的方法,例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰,但未來對多學科合作、臨床適用性驗證和演算法改進的研究,仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。 -##### **NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models** -2502.14482v1 by Chenlu Guo, Yuan Wu, Yi Chang +##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI** +2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia -Parameter-efficient fine-tuning (PEFT) is essential for adapting large -language models (LLMs), with low-rank adaptation (LoRA) being the most popular -approach. However, LoRA suffers from slow convergence, and some recent LoRA -variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD) -for initialization, leading to expensive computation. To mitigate these -problems, we use the Nystr\"om method, which follows a three-matrix -manipulation. We first introduce StructuredLoRA (SLoRA), which investigates -adding a small intermediate matrix between the low-rank matrices A and B. -Secondly, we propose Nystr\"omLoRA (NLoRA), which leverages Nystr\"om-based -initialization for SLoRA to improve its effectiveness and efficiency. Finally, -we propose IntermediateTune (IntTune), which explores fine-tuning exclusively -on the intermediate matrix of NLoRA to further boost LLM efficiency. We -evaluate our methods on five natural language generation (NLG) tasks and eight -natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve -accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with -only 3.67 million additional trainable parameters. IntTune improves average NLG -performance over LoRA by 7.45% while using only 1.25% of its parameters. These -results demonstrate the efficiency and effectiveness of our approach in -enhancing model performance with minimal parameter overhead. +Breast cancer has rapidly increased in prevalence in recent years, making it +one of the leading causes of mortality worldwide. Among all cancers, it is by +far the most common. Diagnosing this illness manually requires significant time +and expertise. Since detecting breast cancer is a time-consuming process, +preventing its further spread can be aided by creating machine-based forecasts. +Machine learning and Explainable AI are crucial in classification as they not +only provide accurate predictions but also offer insights into how the model +arrives at its decisions, aiding in the understanding and trustworthiness of +the classification results. In this study, we evaluate and compare the +classification accuracy, precision, recall, and F-1 scores of five different +machine learning methods using a primary dataset (500 patients from Dhaka +Medical College Hospital). Five different supervised machine learning +techniques, including decision tree, random forest, logistic regression, naive +bayes, and XGBoost, have been used to achieve optimal results on our dataset. +Additionally, this study applied SHAP analysis to the XGBoost model to +interpret the model's predictions and understand the impact of each feature on +the model's output. We compared the accuracy with which several algorithms +classified the data, as well as contrasted with other literature in this field. +After final evaluation, this study found that XGBoost achieved the best model +accuracy, which is 97%. -摘要:參數高效微調 (PEFT) 對於調整大型語言模型 (LLM) 至關重要,其中低秩調整 (LoRA) 是最受歡迎的方法。然而,LoRA 存在收斂速度慢的問題,而一些最近的 LoRA 變體,例如 PiSSA,主要依賴奇異值分解 (SVD) 進行初始化,導致運算成本高昂。為了減輕這些問題,我們使用了 Nystr\"om 方法,它遵循三矩陣操作。我們首先介紹 StructuredLoRA (SLoRA),它研究在低秩矩陣 A 和 B 之間添加一個小的中間矩陣。其次,我們提出了 Nystr\"omLoRA (NLoRA),它利用基於 Nystr\"om 的初始化方法為 SLoRA 提升其有效性和效率。最後,我們提出了 IntermediateTune (IntTune),它探討了僅對 NLoRA 的中間矩陣進行微調,以進一步提升 LLM 效率。我們在五項自然語言生成 (NLG) 任務和八項自然語言理解 (NLU) 任務上評估了我們的這些方法。在 GSM8K 上,SLoRA 和 NLoRA 分別達到了 56.48% 和 57.70% 的準確率,比 LoRA 高出 33.52% 和 36.41%,而僅增加了 367 萬個可訓練參數。IntTune 在僅使用 LoRA 1.25% 的參數的情況下,將平均 NLG 效能提升了 7.45%。這些結果證明了我們的方法在以最少的參數開銷提升模型效能方面的效率和有效性。 +摘要:近年來,乳癌的盛行率迅速增加,使其成為全球主要的死亡原因之一。在所有癌症中,乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時,因此透過建立機器學習模型來預測,有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要,因為它們不僅可以提供準確的預測,還可以深入了解模型如何做出決策,有助於理解和信賴分類結果。在此研究中,我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數,使用了一個主要的資料集(達卡醫學院醫院的 500 名患者)。五種不同的監督式機器學習技術,包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost,已用於在我們的資料集上取得最佳結果。此外,本研究將 SHAP 分析應用於 XGBoost 模型,以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度,並與該領域的其他文獻進行對比。在最後評估後,本研究發現 XGBoost 達到了最佳的模型準確度,為 97%。 -##### **Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression** -2502.14477v1 by Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang +##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI** +2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir -Handling long-context sequences efficiently remains a significant challenge -in large language models (LLMs). Existing methods for token selection in -sequence extrapolation either employ a permanent eviction strategy or select -tokens by chunk, which may lead to the loss of critical information. We propose -Efficient Selective Attention (ESA), a novel approach that extends context -length by efficiently selecting the most critical tokens at the token level to -compute attention. ESA reduces the computational complexity of token selection -by compressing query and key vectors into lower-dimensional representations. We -evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using -open-source LLMs with context lengths of 8k and 32k. ESA outperforms other -selective attention methods, especially in tasks requiring the retrieval of -multiple pieces of information, achieving comparable performance to -full-attention extrapolation methods across various tasks, with superior -results in certain tasks. +The Deep learning (DL) models for diagnosing breast cancer from mammographic +images often operate as "black boxes", making it difficult for healthcare +professionals to trust and understand their decision-making processes. The +study presents an integrated framework combining Convolutional Neural Networks +(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis +of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an +elaborate data preprocessing pipeline and advanced data augmentation techniques +to counteract dataset limitations and transfer learning using pre-trained +networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of +our study is the evaluation of XAI's effectiveness in interpreting model +predictions, highlighted by utilizing the Hausdorff measure to assess the +alignment between AI-generated explanations and expert annotations +quantitatively. This approach is critical for XAI in promoting trustworthiness +and ethical fairness in AI-assisted diagnostics. The findings from our research +illustrate the effective collaboration between CNNs and XAI in advancing +diagnostic methods for breast cancer, thereby facilitating a more seamless +integration of advanced AI technologies within clinical settings. By enhancing +the interpretability of AI driven decisions, this work lays the groundwork for +improved collaboration between AI systems and medical practitioners, ultimately +enriching patient care. Furthermore, the implications of our research extended +well beyond the current methodologies. It encourages further research into how +to combine multimodal data and improve AI explanations to meet the needs of +clinical practice. -摘要:在大型語言模型 (LLM) 中,有效處理長語境序列仍然是一項重大挑戰。現有的序列外推標記選擇方法採用永久驅逐策略或按塊選擇標記,這可能會導致關鍵資訊遺失。我們提出高效選擇性注意 (ESA),這是一種新穎的方法,它透過在標記層級有效選擇最關鍵的標記來計算注意,從而延伸語境長度。ESA 透過將查詢和關鍵向量壓縮成較低維度的表示,來降低標記選擇的運算複雜度。我們使用開放原始碼 LLM,在語境長度為 8k 和 32k 的情況下,對長序列基準進行評估,最大長度達 256k。ESA 的表現優於其他選擇性注意方法,特別是在需要擷取多條資訊的任務中,在各種任務中達到與全注意外推方法相當的效能,並且在某些任務中獲得更佳的結果。 +摘要:深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作,這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構,結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI),以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術,以對抗資料集限制,並採用預先訓練的網路(例如 VGG-16、Inception-V3 和 ResNet)進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性,重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作,從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性,這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎,最終豐富了患者照護。此外,我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋,以滿足臨床實務的需求。 -##### **Argument-Based Comparative Question Answering Evaluation Benchmark** -2502.14476v1 by Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann +##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives** +2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu -In this paper, we aim to solve the problems standing in the way of automatic -comparative question answering. To this end, we propose an evaluation framework -to assess the quality of comparative question answering summaries. We formulate -15 criteria for assessing comparative answers created using manual annotation -and annotation from 6 large language models and two comparative question -asnwering datasets. We perform our tests using several LLMs and manual -annotation under different settings and demonstrate the constituency of both -evaluations. Our results demonstrate that the Llama-3 70B Instruct model -demonstrates the best results for summary evaluation, while GPT-4 is the best -for answering comparative questions. All used data, code, and evaluation -results are publicly -available\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}. +This research presents a novel multimodal data fusion methodology for pain +behavior recognition, integrating statistical correlation analysis with +human-centered insights. Our approach introduces two key innovations: 1) +integrating data-driven statistical relevance weights into the fusion strategy +to effectively utilize complementary information from heterogeneous modalities, +and 2) incorporating human-centric movement characteristics into multimodal +representation learning for detailed modeling of pain behaviors. Validated +across various deep learning architectures, our method demonstrates superior +performance and broad applicability. We propose a customizable framework that +aligns each modality with a suitable classifier based on statistical +significance, advancing personalized and effective multimodal fusion. +Furthermore, our methodology provides explainable analysis of multimodal data, +contributing to interpretable and explainable AI in healthcare. By highlighting +the importance of data diversity and modality-specific representations, we +enhance traditional fusion techniques and set new standards for recognizing +complex pain behaviors. Our findings have significant implications for +promoting patient-centered healthcare interventions and supporting explainable +clinical decision-making. -摘要:在本文中,我們旨在解決阻礙自動比較性問題解答的難題。為此,我們提出一個評估框架,用於評估比較性問題解答摘要的品質。我們制定了 15 項準則,用於評估使用手動標註和來自 6 個大型語言模型和兩個比較性問題解答資料集的標註所建立的比較性答案。我們在不同的設定下使用幾個 LLM 和手動標註執行測試,並展示兩種評估的組成。我們的結果表明,Llama-3 70B Instruct 模型在摘要評估中表現最佳,而 GPT-4 在回答比較性問題方面表現最佳。所有使用過的資料、程式碼和評估結果均公開可用\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}。 +摘要:本研究提出了一種創新的多模態數據融合方法,用於疼痛行為識別,將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新:1) 將數據驅動的統計相關權重整合到融合策略中,以有效利用來自異質模態的補充信息,以及 2) 將以人為中心的運動特徵納入多模態表示學習中,以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證,展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架,根據統計顯著性將每個模態與合適的分類器對齊,推進個性化和有效的多模態融合。此外,我們的模型提供對多模態數據的可解釋分析,有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性,我們增強了傳統的融合技術,並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。 -##### **Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models** -2502.14469v1 by Aurora Polo-Rodríguez, Laura Fiorini, Erika Rovini, Filippo Cavallo, Javier Medina-Quero +##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach** +2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini -This work presents a novel architecture for context-aware interactions within -smart environments, leveraging Large Language Models (LLMs) to enhance user -experiences. Our system integrates user location data obtained through UWB tags -and sensor-equipped smart homes with real-time human activity recognition (HAR) -to provide a comprehensive understanding of user context. This contextual -information is then fed to an LLM-powered chatbot, enabling it to generate -personalised interactions and recommendations based on the user's current -activity and environment. This approach moves beyond traditional static chatbot -interactions by dynamically adapting to the user's real-time situation. A case -study conducted from a real-world dataset demonstrates the feasibility and -effectiveness of our proposed architecture, showcasing its potential to create -more intuitive and helpful interactions within smart homes. The results -highlight the significant benefits of integrating LLM with real-time activity -and location data to deliver personalised and contextually relevant user -experiences. +Human-centered explainable AI (HCXAI) advocates for the integration of social +aspects into AI explanations. Central to the HCXAI discourse is the Social +Transparency (ST) framework, which aims to make the socio-organizational +context of AI systems accessible to their users. In this work, we suggest +extending the ST framework to address the risks of social misattributions in +Large Language Models (LLMs), particularly in sensitive areas like mental +health. In fact LLMs, which are remarkably capable of simulating roles and +personas, may lead to mismatches between designers' intentions and users' +perceptions of social attributes, risking to promote emotional manipulation and +dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To +address these issues, we propose enhancing the ST framework with a fifth +'W-question' to clarify the specific social attributions assigned to LLMs by +its designers and users. This addition aims to bridge the gap between LLM +capabilities and user perceptions, promoting the ethically responsible +development and use of LLM-based technology. -摘要:本研究提出了一種創新的架構,用於在智慧環境中進行情境感知互動,利用大型語言模型 (LLM) 來提升使用者體驗。我們的系統整合了透過超寬頻標籤取得的使用者位置資料,以及配備感測器的智慧家庭,並具備即時人類活動辨識 (HAR),以全面了解使用者的情境。接著,將這些情境資訊輸入 LLM 驅動的聊天機器人,讓它能根據使用者的當前活動和環境產生個人化的互動和建議。這種方法超越了傳統的靜態聊天機器人互動,能動態地適應使用者的即時狀況。從真實世界資料集進行的案例研究,展示了我們提出的架構的可行性和有效性,突顯出它在智慧家庭中創造更直覺且有用的互動的潛力。結果突顯了將 LLM 與即時活動和位置資料整合,以提供個人化且與情境相關的使用者體驗的顯著優點。 +摘要:以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架,其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中,我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险,尤其是在心理健康等敏感领域。事实上,LLM 能够出色地模拟角色和人格,这可能导致设计者的意图和用户对社会属性的认知之间出现错配,从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题,我们建议用第五个“W 问题”来增强 ST 框架,以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距,促进基于 LLM 的技术在道德上负责任地开发和使用。 -##### **Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing** -2502.14458v1 by Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, Albert Gu +##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification** +2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu -We introduce Llamba, a family of efficient recurrent language models -distilled from Llama-3.x into the Mamba architecture. The series includes -Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput -and handle significantly larger batch sizes than Transformer-based models while -maintaining comparable benchmark performance. Furthermore, Llamba demonstrates -the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., -2024), achieving these results with less than 0.1% of the training data -typically used for models of similar size. To take full advantage of their -efficiency, we provide an optimized implementation of Llamba for -resource-constrained devices such as smartphones and edge platforms, offering a -practical and memory-efficient alternative to Transformers. Overall, Llamba -improves the tradeoff between speed, memory efficiency, and performance, making -high-quality language models more accessible. +Background: Pneumothorax is an acute thoracic disease caused by abnormal air +collection between the lungs and chest wall. To address the opaqueness often +associated with deep learning (DL) models, explainable artificial intelligence +(XAI) methods have been introduced to outline regions related to pneumothorax +diagnoses made by DL models. However, these explanations sometimes diverge from +actual lesion areas, highlighting the need for further improvement. Method: We +propose a template-guided approach to incorporate the clinical knowledge of +pneumothorax into model explanations generated by XAI methods, thereby +enhancing the quality of these explanations. Utilizing one lesion delineation +created by radiologists, our approach first generates a template that +represents potential areas of pneumothorax occurrence. This template is then +superimposed on model explanations to filter out extraneous explanations that +fall outside the template's boundaries. To validate its efficacy, we carried +out a comparative analysis of three XAI methods with and without our template +guidance when explaining two DL models in two real-world datasets. Results: The +proposed approach consistently improved baseline XAI methods across twelve +benchmark scenarios built on three XAI methods, two DL models, and two +datasets. The average incremental percentages, calculated by the performance +improvements over the baseline performance, were 97.8% in Intersection over +Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model +explanations and ground-truth lesion areas. Conclusions: In the context of +pneumothorax diagnoses, we proposed a template-guided approach for improving AI +explanations. We anticipate that our template guidance will forge a fresh +approach to elucidating AI models by integrating clinical domain expertise. -摘要:我們推出 Llamba,一種高效的遞迴語言模型家族,從 Llama-3.x 萃取到 Mamba 架構中。該系列包含 Llamba-1B、Llamba-3B 和 Llamba-8B,它們比基於 Transformer 的模型實現更高的推理吞吐量,並處理顯著更大的批次大小,同時保持可比較的基準效能。此外,Llamba 證明了使用 MOHAWK(Bick 等人,2024 年)進行跨架構萃取的有效性,在訓練資料不到類似大小模型通常使用的 0.1% 的情況下實現了這些結果。為了充分利用其效率,我們為 Llamba 提供了針對資源受限裝置(例如智慧型手機和邊緣平台)的最佳化實作,提供實用且記憶體效率高的 Transformer 替代方案。總體而言,Llamba 改善了速度、記憶體效率和效能之間的權衡,讓高品質語言模型更易於取得。 +摘要:背景:氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習(DL)模型經常伴隨的不透明性,可解釋人工智慧(XAI)方法已被引入,用於概述與 DL 模型做出的氣胸診斷相關的區域。然而,這些解釋有時會與實際病灶區域有所出入,突顯出進一步改進的必要性。方法:我們提出了一種模板引導式方法,將氣胸的臨床知識納入 XAI 方法產生的模型解釋中,從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪,我們的做法首先產生一個模板,用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上,以篩選出超出模板邊界的無關解釋。為了驗證其效力,我們對三種 XAI 方法進行了比較分析,在兩個真實世界資料集中解釋兩個 DL 模型時,分別採用和不採用我們的模板引導。結果:所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中,始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時,透過基準效能的效能改進計算出的平均增量百分比為交集比(IoU)的 97.8% 和骰子相似性係數(DSC)的 94.1%。結論:在氣胸診斷的背景下,我們提出了一種模板引導式方法,用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識,為闡明 AI 模型建立一種新方法。 -##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** -2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu +##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures** +2403.01580v1 by Séamus Lankford -To enhance tourists' experiences and immersion, this paper proposes a -narrative-driven travel planning framework called NarrativeGuide, which -generates a geoculturally-grounded narrative script for travelers, offering a -novel, role-playing experience for their journey. In the initial stage, -NarrativeGuide constructs a knowledge graph for attractions within a city, then -configures the worldview, character setting, and exposition based on the -knowledge graph. Using this foundation, the knowledge graph is combined to -generate an independent scene unit for each attraction. During the itinerary -planning stage, NarrativeGuide models narrative-driven travel planning as an -optimization problem, utilizing a genetic algorithm (GA) to refine the -itinerary. Before evaluating the candidate itinerary, transition scripts are -generated for each pair of adjacent attractions, which, along with the scene -units, form a complete script. The weighted sum of script coherence, travel -time, and attraction scores is then used as the fitness value to update the -candidate solution set. Experimental results across four cities, i.e., Nanjing -and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate -significant improvements in narrative coherence and cultural fit, alongside a -notable reduction in travel time and an increase in the quality of visited -attractions. Our study highlights that incorporating external evolutionary -optimization effectively addresses the limitations of large language models in -travel planning.Our codes are available at -https://github.com/Evan01225/Narrative-Driven-Travel-Planning. +In the current machine translation (MT) landscape, the Transformer +architecture stands out as the gold standard, especially for high-resource +language pairs. This research delves into its efficacy for low-resource +language pairs including both the English$\leftrightarrow$Irish and +English$\leftrightarrow$Marathi language pairs. Notably, the study identifies +the optimal hyperparameters and subword model type to significantly improve the +translation quality of Transformer models for low-resource language pairs. + The scarcity of parallel datasets for low-resource languages can hinder MT +development. To address this, gaHealth was developed, the first bilingual +corpus of health data for the Irish language. Focusing on the health domain, +models developed using this in-domain dataset exhibited very significant +improvements in BLEU score when compared with models from the LoResMT2021 +Shared Task. A subsequent human evaluation using the multidimensional quality +metrics error taxonomy showcased the superior performance of the Transformer +system in reducing both accuracy and fluency errors compared to an RNN-based +counterpart. + Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source +applications streamlined for the development, fine-tuning, and deployment of +neural machine translation models. These tools considerably simplify the setup +and evaluation process, making MT more accessible to both developers and +translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes +eco-friendly natural language processing research by highlighting the +environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM +demonstrated advancements in translation performance for two low-resource +language pairs: English$\leftrightarrow$Irish and +English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 +Shared Task. -摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 +摘要:在當前機器翻譯 (MT) 領域中,Transformer 架構脫穎而出,成為黃金標準,特別是對於高資源語言對。本研究探討其對低資源語言對的效能,包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是,本研究識別出最佳超參數和子詞模型類型,以顯著提高 Transformer 模型對低資源語言對的翻譯品質。 +低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題,開發了 gaHealth,這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域,使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步,與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示,與基於 RNN 的對應模型相比,Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。 +此外,本論文介紹了 adaptNMT 和 adaptMLLM,這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程,讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是,adaptNMT 以 OpenNMT 生態系統為基礎,通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比,adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。 -##### **Optimal word order for non-causal text generation with Large Language Models: the Spanish case** -2502.14451v1 by Andrea Busto-Castiñeira, Silvia García-Méndez, Francisco de Arriba-Pérez, Francisco J. González-Castaño +##### **Cause and Effect: Can Large Language Models Truly Understand Causality?** +2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha -Natural Language Generation (NLG) popularity has increased owing to the -progress in Large Language Models (LLMs), with zero-shot inference -capabilities. However, most neural systems utilize decoder-only causal -(unidirectional) transformer models, which are effective for English but may -reduce the richness of languages with less strict word order, subject omission, -or different relative clause attachment preferences. This is the first work -that analytically addresses optimal text generation order for non-causal -language models. We present a novel Viterbi algorithm-based methodology for -maximum likelihood word order estimation. We analyze the non-causal -most-likelihood order probability for NLG in Spanish and, then, the probability -of generating the same phrases with Spanish causal NLG. This comparative -analysis reveals that causal NLG prefers English-like SVO structures. We also -analyze the relationship between optimal generation order and causal -left-to-right generation order using Spearman's rank correlation. Our results -demonstrate that the ideal order predicted by the maximum likelihood estimator -is not closely related to the causal order and may be influenced by the -syntactic structure of the target sentence. +With the rise of Large Language Models(LLMs), it has become crucial to +understand their capabilities and limitations in deciphering and explaining the +complex web of causal relationships that language entails. Current methods use +either explicit or implicit causal reasoning, yet there is a strong need for a +unified approach combining both to tackle a wide array of causal relationships +more effectively. This research proposes a novel architecture called Context +Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to +enhance causal reasoning and explainability. The proposed framework +incorporates an explicit causal detection module with ConceptNet and +counterfactual statements, as well as implicit causal detection through LLMs. +Our framework goes one step further with a layer of counterfactual explanations +to accentuate LLMs understanding of causality. The knowledge from ConceptNet +enhances the performance of multiple causal reasoning tasks such as causal +discovery, causal identification and counterfactual reasoning. The +counterfactual sentences add explicit knowledge of the not caused by scenarios. +By combining these powerful modules, our model aims to provide a deeper +understanding of causal relationships, enabling enhanced interpretability. +Evaluation of benchmark datasets shows improved performance across all metrics, +such as accuracy, precision, recall, and F1 scores. We also introduce +CausalNet, a new dataset accompanied by our code, to facilitate further +research in this domain. -摘要:自然語言生成 (NLG) 的普及歸功於大型語言模型 (LLM) 的進步,以及零次學習推論能力。然而,大多數神經系統使用僅解碼器因果 (單向) Transformer模型,這對英語很有效,但可能會減少語序較不嚴謹、省略主詞或相對從句附加偏好不同的語言的豐富性。這是第一個針對非因果語言模型分析性地解決最佳文字生成順序的研究。我們提出了一種基於維特比演算法的新方法,用於最大似然詞序估計。我們分析了西班牙語 NLG 的非因果最大似然順序機率,然後分析了使用西班牙語因果 NLG 生成相同短語的機率。這種比較分析顯示,因果 NLG 偏好英語式的 SVO 結構。我們還使用 Spearman 等級相關性分析最佳生成順序和因果從左到右生成順序之間的關係。我們的結果表明,最大似然估計器預測的理想順序與因果順序沒有密切關係,並且可能會受到目標句子的語法結構影響。 +摘要:隨著大型語言模型 (LLM) 的興起,了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理,但強烈需要一種統一的方法,結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構,以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組,以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步,加入一層反事實解釋,以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行,例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組,我們的模型旨在提供對因果關係更深入的理解,實現增強的可解釋性。基準資料集的評估顯示在所有指標(例如準確度、精確度、召回率和 F1 分數)上都有所提升。我們還引入了 CausalNet,一個新的資料集,並附上了我們的程式碼,以促進在這個領域的進一步研究。 +##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina** +2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya -### Knowledge Graphs +Diabetes mellitus (DM) predisposes patients to vascular complications. +Retinal images and vasculature reflect the body's micro- and macrovascular +health. They can be used to diagnose DM complications, including diabetic +retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular +disease, as well as forecast the risk of cardiovascular events. Artificial +intelligence (AI)-enabled systems developed for high-throughput detection of DR +using digitized retinal images have become clinically adopted. Beyond DR +screening, AI integration also holds immense potential to address challenges +associated with the holistic care of the patient with DM. In this work, we aim +to comprehensively review the literature for studies on AI applications based +on retinal images related to DM diagnosis, prognostication, and management. We +will describe the findings of holistic AI-assisted diabetes care, including but +not limited to DR screening, and discuss barriers to implementing such systems, +including issues concerning ethics, data privacy, equitable access, and +explainability. With the ability to evaluate the patient's health status vis a +vis DM complication as well as risk prognostication of future cardiovascular +complications, AI-assisted retinal image analysis has the potential to become a +central tool for modern personalized medicine in patients with DM. + +摘要:糖尿病(DM)使患者容易出現血管併發症。 +視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症,包括糖尿病視網膜病變(DR)、神經病變、腎病和動脈粥樣硬化性心血管疾病,以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧(AI)啟用系統已在臨床採用。除了 DR 篩檢外,AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中,我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻,這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現,包括但不限於 DR 篩檢,並討論實施此類系統的障礙,包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況,同時考量糖尿病併發症以及未來心血管併發症的風險預後,AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。 + + +### Medical |Publish Date|Title|Authors|Homepage|Code| | :---: | :---: | :---: | :---: | :---: | -|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| -|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| -|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| -|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| -|**2025-02-20**|**Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment**|Jiaxi Li et.al.|[2502.14275v1](http://arxiv.org/abs/2502.14275v1)|null| -|**2025-02-20**|**Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering**|Rongzhi Zhu et.al.|[2502.14245v1](http://arxiv.org/abs/2502.14245v1)|null| -|**2025-02-20**|**NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM**|Jiayin Lan et.al.|[2502.14192v1](http://arxiv.org/abs/2502.14192v1)|null| -|**2025-02-19**|**Object-centric Binding in Contrastive Language-Image Pretraining**|Rim Assouel et.al.|[2502.14113v1](http://arxiv.org/abs/2502.14113v1)|null| +|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| +|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| +|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| +|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| +|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| +|**2025-02-20**|**MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models**|Shrey Pandit et.al.|[2502.14302v1](http://arxiv.org/abs/2502.14302v1)|null| +|**2025-02-20**|**EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement**|Wenhui Zhu et.al.|[2502.14260v1](http://arxiv.org/abs/2502.14260v1)|null| |**2025-02-19**|**Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning**|Cole Gawin et.al.|[2502.14086v1](http://arxiv.org/abs/2502.14086v1)|null| -|**2025-02-19**|**Neurosymbolic artificial intelligence via large language models and coherence-driven inference**|Steve Huntsman et.al.|[2502.13953v1](http://arxiv.org/abs/2502.13953v1)|null| -|**2025-02-19**|**Complex Ontology Matching with Large Language Model Embeddings**|Guilherme Sousa et.al.|[2502.13619v1](http://arxiv.org/abs/2502.13619v1)|null| -|**2025-02-19**|**Are Large Language Models In-Context Graph Learners?**|Jintang Li et.al.|[2502.13562v1](http://arxiv.org/abs/2502.13562v1)|null| +|**2025-02-19**|**Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging**|Shansong Wang et.al.|[2502.14064v1](http://arxiv.org/abs/2502.14064v1)|null| +|**2025-02-19**|**VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare**|Anudeex Shetty et.al.|[2502.13775v1](http://arxiv.org/abs/2502.13775v1)|null| +|**2025-02-19**|**PeerQA: A Scientific Question Answering Dataset from Peer Reviews**|Tim Baumgärtner et.al.|[2502.13668v1](http://arxiv.org/abs/2502.13668v1)|[link](https://github.com/ukplab/peerqa)| |**2025-02-19**|**Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs**|Yushi Feng et.al.|[2502.13555v1](http://arxiv.org/abs/2502.13555v1)|[link](https://github.com/ys-feng/DemoGraph)| -|**2025-02-19**|**PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference**|Burc Gokden et.al.|[2502.13502v1](http://arxiv.org/abs/2502.13502v1)|[link](https://github.com/burcgokden/PLDR-LLM-with-KVG-cache)| -|**2025-02-19**|**Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction**|Yanbang Sun et.al.|[2502.13412v1](http://arxiv.org/abs/2502.13412v1)|null| -|**2025-02-19**|**Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval**|Aditya Sharma et.al.|[2502.13369v1](http://arxiv.org/abs/2502.13369v1)|null| -|**2025-02-19**|**Craw4LLM: Efficient Web Crawling for LLM Pretraining**|Shi Yu et.al.|[2502.13347v1](http://arxiv.org/abs/2502.13347v1)|[link](https://github.com/cxcscmu/crawl4llm)| -|**2025-02-18**|**K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction**|Tassallah Abdullahi et.al.|[2502.13344v1](http://arxiv.org/abs/2502.13344v1)|[link](https://github.com/rsinghlab/K-Paths)| -|**2025-02-18**|**Grounding LLM Reasoning with Knowledge Graphs**|Alfonso Amayuelas et.al.|[2502.13247v1](http://arxiv.org/abs/2502.13247v1)|null| -|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null| -|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null| -|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null| -|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null| -|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null| -|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|[link](https://github.com/yuhan1i/g-refer)| -|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null| -|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null| -|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null| -|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|[link](https://github.com/acnlplab/umr-text-gen)| -|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null| -|**2025-02-17**|**Exploring LLM-based Student Simulation for Metacognitive Cultivation**|Haoxuan Li et.al.|[2502.11678v1](http://arxiv.org/abs/2502.11678v1)|null| -|**2025-02-17**|**Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering**|Runxuan Liu et.al.|[2502.11491v1](http://arxiv.org/abs/2502.11491v1)|null| -|**2025-02-17**|**GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion**|Kangyang Luo et.al.|[2502.11471v1](http://arxiv.org/abs/2502.11471v1)|null| -|**2025-02-16**|**Large Language-Geometry Model: When LLM meets Equivariance**|Zongzhao Li et.al.|[2502.11149v2](http://arxiv.org/abs/2502.11149v2)|null| -|**2025-02-16**|**Beyond Pairwise: Global Zero-shot Temporal Graph Generation**|Alon Eirew et.al.|[2502.11114v1](http://arxiv.org/abs/2502.11114v1)|null| +|**2025-02-19**|**MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis**|Wei Dai et.al.|[2502.13524v1](http://arxiv.org/abs/2502.13524v1)|[link](https://github.com/anthonyweidai/MobileViM_3D)| +|**2025-02-19**|**Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion**|Shuai Niu et.al.|[2502.13509v1](http://arxiv.org/abs/2502.13509v1)|null| +|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| +|**2025-02-19**|**RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering**|Sichu Liang et.al.|[2502.13361v1](http://arxiv.org/abs/2502.13361v1)|null| +|**2025-02-18**|**Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance**|Tejas Srinivasan et.al.|[2502.13321v1](http://arxiv.org/abs/2502.13321v1)|null| +|**2025-02-18**|**Prediction of Clinical Complication Onset using Neural Point Processes**|Sachini Weerasekara et.al.|[2502.13290v1](http://arxiv.org/abs/2502.13290v1)|null| +|**2025-02-18**|**SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?**|Yucheng Shi et.al.|[2502.13233v1](http://arxiv.org/abs/2502.13233v1)|null| +|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null| +|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null| +|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null| +|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Li et.al.|[2502.12825v2](http://arxiv.org/abs/2502.12825v2)|null| +|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|[link](https://github.com/Avenge-PRC777/LLM-Safety-For-Children-Code)| +|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null| +|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null| +|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|[link](https://github.com/AmmarKheder/AQ-Net)| +|**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null| +|**2025-02-17**|**LLM Agents Making Agent Tools**|Georg Wölflein et.al.|[2502.11705v1](http://arxiv.org/abs/2502.11705v1)|null| +|**2025-02-17**|**MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression**|Linjie Mu et.al.|[2502.11651v1](http://arxiv.org/abs/2502.11651v1)|[link](https://github.com/linjiemu/mmxu)| +|**2025-02-17**|**A Survey of Personalized Large Language Models: Progress and Future Directions**|Jiahong Liu et.al.|[2502.11528v1](http://arxiv.org/abs/2502.11528v1)|null| +|**2025-02-17**|**Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos**|Xiangxiang Cui et.al.|[2502.11481v1](http://arxiv.org/abs/2502.11481v1)|null| +|**2025-02-17**|**Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation**|Yanyan Wang et.al.|[2502.11456v1](http://arxiv.org/abs/2502.11456v1)|[link](https://github.com/Yaan-Wang/CRLN)| +|**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null| +|**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|[link](https://github.com/sohyu1/rt-demt)| |**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|[link](https://github.com/alexlecu/llmkgraph)| -|**2025-02-16**|**Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection**|Yang Zhao et.al.|[2502.11062v1](http://arxiv.org/abs/2502.11062v1)|null| -|**2025-02-16**|**CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models**|Yuefei Chen et.al.|[2502.11008v1](http://arxiv.org/abs/2502.11008v1)|null| -|**2025-02-16**|**RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation**|Pengcheng Jiang et.al.|[2502.10996v1](http://arxiv.org/abs/2502.10996v1)|[link](https://github.com/pat-jj/Retrieval-And-Structure)| -|**2025-02-15**|**Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia**|Rohith Perumandla et.al.|[2502.10896v1](http://arxiv.org/abs/2502.10896v1)|null| -|**2025-02-15**|**Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)**|Sandra Schaftner et.al.|[2502.10768v1](http://arxiv.org/abs/2502.10768v1)|null| -|**2025-02-15**|**K-Edit: Language Model Editing with Contextual Knowledge Awareness**|Elan Markowitz et.al.|[2502.10626v1](http://arxiv.org/abs/2502.10626v1)|null| +|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null| +|**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|[link](https://github.com/clmfap/clmfap)| +|**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null| +|**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null| +|**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|[link](https://github.com/hasakixie123/llm-evaluator-uncertainty)| +|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)| +|**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null| |**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null| -|**2025-02-14**|**GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs**|Shima Khoshraftar et.al.|[2502.10522v1](http://arxiv.org/abs/2502.10522v1)|null| -|**2025-02-14**|**Do Large Language Models Reason Causally Like Us? Even Better?**|Hanna M. Dettki et.al.|[2502.10215v1](http://arxiv.org/abs/2502.10215v1)|null| -|**2025-02-14**|**Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages**|Daniil Gurgurov et.al.|[2502.10140v1](http://arxiv.org/abs/2502.10140v1)|null| -|**2025-02-14**|**Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models**|Chenrui Tie et.al.|[2502.10090v1](http://arxiv.org/abs/2502.10090v1)|null| -|**2025-02-14**|**Decision Information Meets Large Language Models: The Future of Explainable Operations Research**|Yansen Zhang et.al.|[2502.09994v1](http://arxiv.org/abs/2502.09994v1)|null| -|**2025-02-14**|**KGGen: Extracting Knowledge Graphs from Plain Text with Language Models**|Belinda Mo et.al.|[2502.09956v1](http://arxiv.org/abs/2502.09956v1)|null| -|**2025-02-14**|**ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation**|Shu Wang et.al.|[2502.09891v1](http://arxiv.org/abs/2502.09891v1)|null| -|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null| +|**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null| +|**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null| +|**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v2](http://arxiv.org/abs/2502.10526v2)|null| +|**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null| +|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| +|**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null| +|**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null| +|**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null| +|**2025-02-14**|**HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation**|Tianwei Lin et.al.|[2502.09838v2](http://arxiv.org/abs/2502.09838v2)|[link](https://github.com/dcdmllm/healthgpt)| +|**2025-02-13**|**Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games**|Tong Yang et.al.|[2502.09780v1](http://arxiv.org/abs/2502.09780v1)|null| +|**2025-02-13**|**The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention**|Bereket A. Yilma et.al.|[2502.09757v1](http://arxiv.org/abs/2502.09757v1)|null| +|**2025-02-13**|**A CNN Approach to Automated Detection and Classification of Brain Tumors**|Md. Zahid Hasan et.al.|[2502.09731v1](http://arxiv.org/abs/2502.09731v1)|null| +|**2025-02-13**|**Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data**|Yu Leng et.al.|[2502.09715v1](http://arxiv.org/abs/2502.09715v1)|null| +|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null| +|**2025-02-13**|**Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling**|Benjamin D. Killeen et.al.|[2502.09688v1](http://arxiv.org/abs/2502.09688v1)|null| +|**2025-02-13**|**Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models**|Wiktoria Mieleszczenko-Kowszewicz et.al.|[2502.09687v1](http://arxiv.org/abs/2502.09687v1)|null| +|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null| +|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null| +|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| +|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null| +|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null| +|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null| +|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)| +|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)| |**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null| -|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null| -|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v3](http://arxiv.org/abs/2502.08346v3)|null| -|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null| -|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null| -|**2025-02-12**|**LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search**|Yang Gao et.al.|[2502.10459v1](http://arxiv.org/abs/2502.10459v1)|null| -|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null| -|**2025-02-12**|**Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality**|Xin Kang et.al.|[2502.09658v1](http://arxiv.org/abs/2502.09658v1)|null| -|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null| -|**2025-02-12**|**Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach**|Régnier Avice et.al.|[2502.10453v1](http://arxiv.org/abs/2502.10453v1)|[link](https://github.com/ravice234/cryptoasset-attribution-tag-linker)| -|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null| -|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null| -|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)| -|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null| -|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|[link](https://github.com/YuxingLu613/KARMA)| -|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null| -|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null| -|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null| -|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null| -|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null| -|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null| -|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null| -|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null| -|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)| -|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null| -|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|[link](https://github.com/Infini-AI-Lab/gsm_infinite)| -|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null| -|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)| -|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null| -|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)| -|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)| -|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null| -|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)| -|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)| -|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null| -|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null| -|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null| -|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null| -|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null| -|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null| -|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null| -|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null| -|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null| -|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null| -|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null| -|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)| -|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null| -|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null| -|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)| +|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null| +|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null| +|**2025-02-12**|**Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models**|Hasin Rehana et.al.|[2502.09659v1](http://arxiv.org/abs/2502.09659v1)|null| +|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null| +|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null| +|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v2](http://arxiv.org/abs/2502.07752v2)|null| +|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)| +|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)| +|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null| +|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)| +|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null| +|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null| +|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null| +|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null| +|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null| +|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null| +|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null| +|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null| +|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null| +|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null| +|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| +|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null| +|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)| +|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null| +|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null| +|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null| +|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null| +|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null| +|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null| +|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null| +|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)| #### Abstracts -##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** -2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu - -Large Language Models (LLMs) have shown great promise in tool-making, yet -existing frameworks often struggle to efficiently construct reliable toolsets -and are limited to single-task settings. To address these challenges, we -propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that -dynamically constructs and evolves a hierarchical graph of reusable tools -across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), -agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, -TabMWP). Our results show that GATE achieves up to 4.3x faster milestone -completion in Minecraft compared to the previous SOTA, and provides an average -improvement of 9.23% over existing tool-making methods in code generation tasks -and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, -balancing tool quantity, complexity, and functionality while maintaining high -efficiency. Code and data are available at -\url{https://github.com/ayanami2003/GATE}. - -摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 - -##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** -2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su +##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** +2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub -Our ability to continuously acquire, organize, and leverage knowledge is a -key feature of human intelligence that AI systems must approximate to unlock -their full potential. Given the challenges in continual learning with large -language models (LLMs), retrieval-augmented generation (RAG) has become the -dominant way to introduce new information. However, its reliance on vector -retrieval hinders its ability to mimic the dynamic and interconnected nature of -human long-term memory. Recent RAG approaches augment vector embeddings with -various structures like knowledge graphs to address some of these gaps, namely -sense-making and associativity. However, their performance on more basic -factual memory tasks drops considerably below standard RAG. We address this -unintended deterioration and propose HippoRAG 2, a framework that outperforms -standard RAG comprehensively on factual, sense-making, and associative memory -tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in -HippoRAG and enhances it with deeper passage integration and more effective -online use of an LLM. This combination pushes this RAG system closer to the -effectiveness of human long-term memory, achieving a 7% improvement in -associative memory tasks over the state-of-the-art embedding model while also -exhibiting superior factual knowledge and sense-making memory capabilities. -This work paves the way for non-parametric continual learning for LLMs. Our -code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. +Foundation models are becoming increasingly effective in the medical domain, +offering pre-trained models on large datasets that can be readily adapted for +downstream tasks. Despite progress, fetal ultrasound images remain a +challenging domain for foundation models due to their inherent complexity, +often requiring substantial additional training and facing limitations due to +the scarcity of paired multimodal data. To overcome these challenges, here we +introduce FetalCLIP, a vision-language foundation model capable of generating +universal representation of fetal ultrasound images. FetalCLIP was pre-trained +using a multimodal learning approach on a diverse dataset of 210,035 fetal +ultrasound images paired with text. This represents the largest paired dataset +of its kind used for foundation model development to date. This unique training +approach allows FetalCLIP to effectively learn the intricate anatomical +features present in fetal ultrasound images, resulting in robust +representations that can be used for a variety of downstream applications. In +extensive benchmarking across a range of key fetal ultrasound applications, +including classification, gestational age estimation, congenital heart defect +(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all +baselines while demonstrating remarkable generalizability and strong +performance even with limited labeled data. We plan to release the FetalCLIP +model publicly for the benefit of the broader scientific community. -摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 +摘要:基礎模型在醫療領域正變得越來越有效, +提供在大型資料集上預先訓練的模型,可輕鬆適應 +下游任務。儘管有進展,但胎兒超音波影像仍然是 +基礎模型的挑戰領域,因為它們固有的複雜性, +通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 +介紹 FetalCLIP,一種能夠產生 +胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 +超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 +方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 +表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 +(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 +效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 -##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** -2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao +##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** +2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes -Large Language Models (LLMs) have demonstrated exceptional abilities in -reasoning for task planning. However, challenges remain under-explored for -parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in -which the model first decomposes a real-life textual task into executable -subtasks and constructs an abstract task graph. The model then understands this -task graph as input and generates a plan for parallel execution. To enhance the -planning capability of complex, scalable graphs, we design an automated and -controllable pipeline to generate synthetic graphs and propose a two-stage -training scheme. Experimental results show that our plan-over-graph method -significantly improves task performance on both API-based LLMs and trainable -open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally -supports parallel execution, demonstrating global efficiency. The code and data -are available at https://github.com/zsq259/Plan-over-Graph. +Fact verification (FV) aims to assess the veracity of a claim based on +relevant evidence. The traditional approach for automated FV includes a +three-part pipeline relying on short evidence snippets and encoder-only +inference models. More recent approaches leverage the multi-turn nature of LLMs +to address FV as a step-by-step problem where questions inquiring additional +context are generated and answered until there is enough information to make a +decision. This iterative method makes the verification process rational and +explainable. While these methods have been tested for encyclopedic claims, +exploration on domain-specific and realistic claims is missing. In this work, +we apply an iterative FV system on three medical fact-checking datasets and +evaluate it with multiple settings, including different LLMs, external web +search, and structured reasoning using logic predicates. We demonstrate +improvements in the final performance over traditional approaches and the high +potential of step-by-step FV systems for domain-specific claims. -摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 +摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 -##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** -2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu +##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** +2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari -To enhance tourists' experiences and immersion, this paper proposes a -narrative-driven travel planning framework called NarrativeGuide, which -generates a geoculturally-grounded narrative script for travelers, offering a -novel, role-playing experience for their journey. In the initial stage, -NarrativeGuide constructs a knowledge graph for attractions within a city, then -configures the worldview, character setting, and exposition based on the -knowledge graph. Using this foundation, the knowledge graph is combined to -generate an independent scene unit for each attraction. During the itinerary -planning stage, NarrativeGuide models narrative-driven travel planning as an -optimization problem, utilizing a genetic algorithm (GA) to refine the -itinerary. Before evaluating the candidate itinerary, transition scripts are -generated for each pair of adjacent attractions, which, along with the scene -units, form a complete script. The weighted sum of script coherence, travel -time, and attraction scores is then used as the fitness value to update the -candidate solution set. Experimental results across four cities, i.e., Nanjing -and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate -significant improvements in narrative coherence and cultural fit, alongside a -notable reduction in travel time and an increase in the quality of visited -attractions. Our study highlights that incorporating external evolutionary -optimization effectively addresses the limitations of large language models in -travel planning.Our codes are available at -https://github.com/Evan01225/Narrative-Driven-Travel-Planning. +Medical images are acquired at high resolutions with large fields of view in +order to capture fine-grained features necessary for clinical decision-making. +Consequently, training deep learning models on medical images can incur large +computational costs. In this work, we address the challenge of downsizing +medical images in order to improve downstream computational efficiency while +preserving clinically-relevant features. We introduce MedVAE, a family of six +large-scale 2D and 3D autoencoders capable of encoding medical images as +downsized latent representations and decoding latent representations back to +high-resolution images. We train MedVAE autoencoders using a novel two-stage +training approach with 1,052,730 medical images. Across diverse tasks obtained +from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent +representations in place of high-resolution images when training downstream +models can lead to efficiency benefits (up to 70x improvement in throughput) +while simultaneously preserving clinically-relevant features and (2) MedVAE can +decode latent representations back to high-resolution images with high +fidelity. Our work demonstrates that large-scale, generalizable autoencoders +can help address critical efficiency challenges in the medical domain. Our code +is available at https://github.com/StanfordMIMI/MedVAE. -摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 +摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 -##### **Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment** -2502.14275v1 by Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu +##### **Data-Constrained Synthesis of Training Data for De-Identification** +2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis -Large language models (LLMs) have been widely adopted in various downstream -task domains. However, their ability to directly recall and apply factual -medical knowledge remains under-explored. Most existing medical QA benchmarks -assess complex reasoning or multi-hop inference, making it difficult to isolate -LLMs' inherent medical knowledge from their reasoning capabilities. Given the -high-stakes nature of medical applications, where incorrect information can -have critical consequences, it is essential to evaluate how well LLMs encode, -retain, and recall fundamental medical facts. - To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset -specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ -is constructed from the Unified Medical Language System (UMLS), a large-scale -repository of standardized biomedical vocabularies and knowledge graphs. We -frame knowledge assessment as a binary judgment task, requiring LLMs to verify -the correctness of medical statements extracted from reliable and structured -knowledge sources. - Our experiments reveal that LLMs struggle with factual medical knowledge -retention, exhibiting significant performance variance across different -semantic categories, particularly for rare medical conditions. Furthermore, -LLMs show poor calibration, often being overconfident in incorrect answers. To -mitigate these issues, we explore retrieval-augmented generation, demonstrating -its effectiveness in improving factual accuracy and reducing uncertainty in -medical decision-making. +Many sensitive domains -- such as the clinical domain -- lack widely +available datasets due to privacy risks. The increasing generative capabilities +of large language models (LLMs) have made synthetic datasets a viable path +forward. In this study, we domain-adapt LLMs to the clinical domain and +generate synthetic clinical texts that are machine-annotated with tags for +personally identifiable information using capable encoder-based NER models. The +synthetic corpora are then used to train synthetic NER models. The results show +that training NER models using synthetic corpora incurs only a small drop in +predictive performance. The limits of this process are investigated in a +systematic ablation study -- using both Swedish and Spanish data. Our analysis +shows that smaller datasets can be sufficient for domain-adapting LLMs for data +synthesis. Instead, the effectiveness of this process is almost entirely +contingent on the performance of the machine-annotating NER models trained +using the original data. -摘要:大型語言模型 (LLM) 已廣泛應用於各種下游 -任務領域。然而,它們直接回憶和應用事實 -醫學知識的能力仍未得到充分探索。大多數現有的醫療問答基準 -評估複雜推理或多跳躍推論,這使得難以將 -LLM 內在的醫學知識從其推理能力中分離出來。鑑於 -醫療應用具有高風險,其中不正確的資訊可能會 -造成嚴重後果,因此評估 LLM 編碼、 -保留和回憶基本醫學事實的能力至關重要。 -為了彌合這一差距,我們引入了醫學知識判斷,這是一個專門設計用於測量 LLM 的一跳事實醫學知識的數據集。MKJ -是由統一醫學語言系統 (UMLS) 構建的,UMLS 是標準化生物醫學詞彙和知識圖譜的大型庫。我們 -將知識評估構建為二元判斷任務,要求 LLM 驗證從可靠且結構化的 -知識來源中提取的醫學陳述的正確性。 -我們的實驗表明,LLM 難以保留事實醫學知識,在不同的 -語義類別中表現出顯著的性能差異,特別是對於罕見的醫療狀況。此外, -LLM 表現出校準不佳,通常對不正確的答案過於自信。為了 -減輕這些問題,我們探索了檢索增強生成,證明了其在提高事實準確性和降低不確定性方面的有效性 -在醫療決策制定中。 +摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 -##### **Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering** -2502.14245v1 by Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu +##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** +2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu -In this paper, we identify a critical problem, "lost-in-retrieval", in -retrieval-augmented multi-hop question answering (QA): the key entities are -missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly -degrades the retrieval performance, which disrupts the reasoning chain and -leads to the incorrect answers. To resolve this problem, we propose a -progressive retrieval and rewriting method, namely ChainRAG, which sequentially -handles each sub-question by completing missing key entities and retrieving -relevant sentences from a sentence graph for answer generation. Each step in -our retrieval and rewriting process builds upon the previous one, creating a -seamless chain that leads to accurate retrieval and answers. Finally, all -retrieved sentences and sub-question answers are integrated to generate a -comprehensive answer to the original question. We evaluate ChainRAG on three -multi-hop QA datasets$\unicode{x2013}$MuSiQue, 2Wiki, and -HotpotQA$\unicode{x2013}$using three large language models: GPT4o-mini, -Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG -consistently outperforms baselines in both effectiveness and efficiency. +Protein backbone generation plays a central role in de novo protein design +and is significant for many biological and medical applications. Although +diffusion and flow-based generative models provide potential solutions to this +challenging task, they often generate proteins with undesired designability and +suffer computational inefficiency. In this study, we propose a novel rectified +quaternion flow (ReQFlow) matching method for fast and high-quality protein +backbone generation. In particular, our method generates a local translation +and a 3D rotation from random noise for each residue in a protein chain, which +represents each 3D rotation as a unit quaternion and constructs its flow by +spherical linear interpolation (SLERP) in an exponential format. We train the +model by quaternion flow (QFlow) matching with guaranteed numerical stability +and rectify the QFlow model to accelerate its inference and improve the +designability of generated protein backbones, leading to the proposed ReQFlow +model. Experiments show that ReQFlow achieves state-of-the-art performance in +protein backbone generation while requiring much fewer sampling steps and +significantly less inference time (e.g., being 37x faster than RFDiffusion and +62x faster than Genie2 when generating a backbone of length 300), demonstrating +its effectiveness and efficiency. The code is available at +https://github.com/AngxiaoYue/ReQFlow. -摘要:在本文中,我們在檢索增強的多跳問答 (QA) 中發現了一個關鍵問題「檢索中遺失」,關鍵實體遺失在 LLM 的子問題分解中。「檢索中遺失」顯著降低檢索效能,這會中斷推理鏈並導致錯誤的答案。為了解決此問題,我們提出了一種漸進式檢索和重寫方法,即 ChainRAG,它通過完成遺失的關鍵實體並從句子圖中檢索相關句子來順序處理每個子問題以產生答案。我們檢索和重寫過程中每一步都建立在前一步之上,創造了一個無縫的鏈,導致準確的檢索和答案。最後,所有檢索到的句子和子問題答案都整合起來,以產生對原始問題的全面答案。我們在三個多跳問答資料集$\unicode{x2013}$MuSiQue、2Wiki 和 HotpotQA$\unicode{x2013}$上評估 ChainRAG,使用三個大型語言模型:GPT4o-mini、Qwen2.5-72B 和 GLM-4-Plus。實證結果表明,ChainRAG 在有效性和效率方面都持續優於基準。 +摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 -##### **NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM** -2502.14192v1 by Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin +##### **MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models** +2502.14302v1 by Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding -Large language models (LLMs) have been widely applied in question answering -over scientific research papers. To enhance the professionalism and accuracy of -responses, many studies employ external knowledge augmentation. However, -existing structures of external knowledge in scientific literature often focus -solely on either paper entities or domain concepts, neglecting the intrinsic -connections between papers through shared domain concepts. This results in less -comprehensive and specific answers when addressing questions that combine -papers and concepts. To address this, we propose a novel knowledge graph -framework that captures deep conceptual relations between academic papers, -constructing a relational network via intra-paper semantic elements and -inter-paper citation relations. Using a few-shot knowledge graph construction -method based on LLM, we develop NLP-AKG, an academic knowledge graph for the -NLP domain, by extracting 620,353 entities and 2,271,584 relations from 60,826 -papers in ACL Anthology. Based on this, we propose a 'sub-graph community -summary' method and validate its effectiveness on three NLP scientific -literature question answering datasets. +Advancements in Large Language Models (LLMs) and their increasing use in +medical question-answering necessitate rigorous evaluation of their +reliability. A critical challenge lies in hallucination, where models generate +plausible yet factually incorrect outputs. In the medical domain, this poses +serious risks to patient safety and clinical decision-making. To address this, +we introduce MedHallu, the first benchmark specifically designed for medical +hallucination detection. MedHallu comprises 10,000 high-quality question-answer +pairs derived from PubMedQA, with hallucinated answers systematically generated +through a controlled pipeline. Our experiments show that state-of-the-art LLMs, +including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, +struggle with this binary hallucination detection task, with the best model +achieving an F1 score as low as 0.625 for detecting "hard" category +hallucinations. Using bidirectional entailment clustering, we show that +harder-to-detect hallucinations are semantically closer to ground truth. +Through experiments, we also show incorporating domain-specific knowledge and +introducing a "not sure" category as one of the answer categories improves the +precision and F1 scores by up to 38% relative to baselines. -摘要:大型语言模型 (LLM) 已广泛应用于科学研究论文的问答中。为了提高响应的专业性和准确性,许多研究采用外部知识增强。然而,科学文献中现有外部知识的结构通常仅关注论文实体或领域概念,而忽略了论文之间通过共享领域概念而形成的内在联系。这导致在解决结合论文和概念的问题时,答案不够全面和具体。为了解决这个问题,我们提出了一种新颖的知识图谱框架,该框架捕获了学术论文之间的深层概念关系,通过论文内部语义元素和论文之间的引用关系构建关系网络。我们使用基于 LLM 的少量知识图谱构建方法,从 ACL Anthology 中的 60,826 篇论文中提取了 620,353 个实体和 2,271,584 个关系,开发了 NLP 领域的学术知识图谱 NLP-AKG。在此基础上,我们提出了一种“子图社区摘要”方法,并在三个 NLP 科学文献问答数据集上验证了其有效性。 +摘要:大型語言模型 (LLM) 的進步及其在醫療問答中的使用日益增加,因此需要嚴格評估其可靠性。一個關鍵的挑戰在於幻覺,模型會產生看似合理但事實上不正確的輸出。在醫療領域,這對患者安全和臨床決策構成嚴重風險。為了解決此問題,我們推出了 MedHallu,這是第一個專門設計用於檢測醫療幻覺的基準。MedHallu 包含 10,000 個從 PubMedQA 衍生的高品質問答對,並透過受控管道系統性地產生幻覺答案。我們的實驗顯示,包括 GPT-4o、Llama-3.1 和經過醫學微調的 UltraMedical 在內的最新 LLM 難以執行這個二元幻覺檢測任務,最佳模型在檢測「困難」類別幻覺時達到的 F1 分數低至 0.625。使用雙向蘊涵聚類,我們表明較難檢測的幻覺在語義上更接近真實。透過實驗,我們還表明,納入特定領域的知識並將「不確定」類別作為其中一個答案類別,可以將精確度和 F1 分數相對於基線提高多達 38%。 -##### **Object-centric Binding in Contrastive Language-Image Pretraining** -2502.14113v1 by Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano +##### **EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement** +2502.14260v1 by Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang -Recent advances in vision language models (VLM) have been driven by -contrastive models such as CLIP, which learn to associate visual information -with their corresponding text descriptions. However, these models have -limitations in understanding complex compositional scenes involving multiple -objects and their spatial relationships. To address these challenges, we -propose a novel approach that diverges from commonly used strategies, which -rely on the design of hard-negative augmentations. Instead, our work focuses on -integrating inductive biases into pre-trained CLIP-like models to improve their -compositional understanding without using any additional hard-negatives. To -that end, we introduce a binding module that connects a scene graph, derived -from a text description, with a slot-structured image representation, -facilitating a structured similarity assessment between the two modalities. We -also leverage relationships as text-conditioned visual constraints, thereby -capturing the intricate interactions between objects and their contextual -relationships more effectively. Our resulting model not only enhances the -performance of CLIP-based models in multi-object compositional understanding -but also paves the way towards more accurate and sample-efficient image-text -matching of complex scenes. +Over the past decade, generative models have achieved significant success in +enhancement fundus images.However, the evaluation of these models still +presents a considerable challenge. A comprehensive evaluation benchmark for +fundus image enhancement is indispensable for three main reasons: 1) The +existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to +downstream real-world clinical research (e.g., Vessel morphology consistency). +2) There is a lack of comprehensive evaluation for both paired and unpaired +enhancement methods, along with the need for expert protocols to accurately +assess clinical value. 3) An ideal evaluation system should provide insights to +inform future developments of fundus image enhancement. To this end, we propose +a novel comprehensive benchmark, EyeBench, to provide insights that align +enhancement models with clinical needs, offering a foundation for future work +to improve the clinical relevance and applicability of generative models for +fundus image enhancement. EyeBench has three appealing properties: 1) +multi-dimensional clinical alignment downstream evaluation: In addition to +evaluating the enhancement task, we provide several clinically significant +downstream tasks for fundus images, including vessel segmentation, DR grading, +denoising generalization, and lesion segmentation. 2) Medical expert-guided +evaluation design: We introduce a novel dataset that promote comprehensive and +fair comparisons between paired and unpaired methods and includes a manual +evaluation protocol by medical experts. 3) Valuable insights: Our benchmark +study provides a comprehensive and rigorous evaluation of existing methods +across different downstream tasks, assisting medical experts in making informed +choices. Additionally, we offer further analysis of the challenges faced by +existing methods. The code is available at +\url{https://github.com/Retinal-Research/EyeBench} -摘要:最近视觉语言模型 (VLM) 的进步是由对比模型(例如 CLIP)推动的,该模型学习将视觉信息与其对应的文本描述联系起来。然而,这些模型在理解涉及多个对象及其空间关系的复杂组合场景方面存在局限性。为了应对这些挑战,我们提出了一种新颖的方法,它偏离了常用的策略,即依赖于硬负增强设计。相反,我们的工作重点是将归纳偏差集成到预训练的类似 CLIP 的模型中,以提高其组合理解能力,而无需使用任何其他硬否定。为此,我们引入了一个绑定模块,它将从文本描述中派生的场景图与槽结构图像表示连接起来,从而促进了两种模式之间的结构化相似性评估。我们还利用关系作为文本条件的视觉约束,从而更有效地捕捉对象及其上下文关系之间的复杂交互。我们由此产生的模型不仅增强了基于 CLIP 的模型在多对象组合理解中的性能,而且还为复杂场景的更准确和样本高效的图像文本匹配铺平了道路。 +摘要:在過去的十年中,生成模型在增強眼底影像方面取得了顯著的成功。然而,這些模型的評估仍然是一個相當大的挑戰。一個全面的眼底影像增強評估基準對於三個主要原因是不可或缺的:1) 現有的去噪指標(例如 PSNR、SSIM)很難擴展到下游的真實世界臨床研究(例如血管形態一致性)。2) 缺乏對配對和非配對增強方法的全面評估,以及需要專家協議來準確評估臨床價值。3) 一個理想的評估系統應該提供見解,以告知眼底影像增強的未來發展。為此,我們提出了一個新的綜合基準 EyeBench,以提供見解,將增強模型與臨床需求相結合,為未來的研究奠定基礎,以提高生成模型在眼底影像增強方面的臨床相關性和適用性。EyeBench 有三個吸引人的特性:1) 多維臨床對齊下游評估:除了評估增強任務外,我們還為眼底影像提供了幾個臨床上重要的下游任務,包括血管分割、DR 分級、去噪泛化和病灶分割。2) 醫學專家指導的評估設計:我們引入了一個新的數據集,以促進對配對和非配對方法的全面和公平比較,並包括由醫學專家進行的手動評估協議。3) 有價值的見解:我們的基準研究提供了對現有方法在不同下游任務中的全面且嚴格的評估,協助醫學專家做出明智的選擇。此外,我們還進一步分析了現有方法面臨的挑戰。程式碼可在 \url{https://github.com/Retinal-Research/EyeBench} 獲得 ##### **Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning** 2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal @@ -7975,58 +7904,81 @@ selective retrieval, for obtaining better performance. 摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。 -##### **Neurosymbolic artificial intelligence via large language models and coherence-driven inference** -2502.13953v1 by Steve Huntsman, Jewell Thomas +##### **Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging** +2502.14064v1 by Shansong Wang, Mojtaba Safari, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang -We devise an algorithm to generate sets of propositions that objectively -instantiate graphs that support coherence-driven inference. We then benchmark -the ability of large language models (LLMs) to reconstruct coherence graphs -from (a straightforward transformation of) propositions expressed in natural -language, with promising results from a single prompt to models optimized for -reasoning. Combining coherence-driven inference with consistency evaluations by -neural models may advance the state of the art in machine cognition. +Vision foundation models (VFMs) are pre-trained on extensive image datasets +to learn general representations for diverse types of data. These models can +subsequently be fine-tuned for specific downstream tasks, significantly +boosting performance across a broad range of applications. However, existing +vision foundation models that claim to be applicable to various radiology tasks +are mostly pre-trained on 3D computed tomography (CT), which benefits from the +availability of extensive 3D CT databases. Significant differences between CT +and magnetic resonance imaging (MRI) in imaging principles, signal +characteristics, and data distribution may hinder their practical performance +and versatility in MRI-specific applications. Here, we propose Triad, a vision +foundation model for 3D MRI. Triad adopts a widely used autoencoder +architecture to learn robust representations from 131,170 3D MRI volumes and +uses organ-independent imaging descriptions to constrain the semantic +distribution of the visual modality. The above pre-training dataset is called +Triad-131K, which is currently the largest 3D MRI pre-training dataset. We +evaluate Triad across three tasks, namely, organ/tumor segmentation, +organ/cancer classification, and medical image registration, in two data +modalities (within-domain and out-of-domain) settings using 25 downstream +datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad +improves segmentation performance by 6.88% compared to nnUNet-Scratch across 17 +datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in +classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% +compared to SwinUNETR-Scratch in registration tasks across two datasets. Our +study demonstrates that pre-training can maximize performance when the data +modalities and organs of upstream and downstream tasks are consistent. -摘要:我們設計一種演算法,用來產生命題集合,以客觀地實例化支援連貫性驅動推論的圖形。接著,我們基準化大型語言模型 (LLM) 從以自然語言表達的命題(經過直接轉換)重建連貫性圖形的能力,結果顯示,單一提示就能從最佳化用於推理的模型中獲得有希望的結果。將連貫性驅動推論與神經模型的一致性評估結合起來,可能會提升機器認知的現有技術。 +摘要:視覺基礎模型 (VFM) 在廣泛的影像資料集上進行預訓練,以學習各種資料類型的通用表示。這些模型隨後可以針對特定的下游任務進行微調,大幅提升各種應用程式的效能。然而,現有的視覺基礎模型聲稱適用於各種放射學任務,但大多是針對 3D 電腦斷層攝影 (CT) 進行預訓練,這得利於廣泛的 3D CT 資料庫。CT 和磁振造影 (MRI) 在影像原理、訊號特性和資料分佈上的顯著差異,可能會阻礙其在 MRI 特定應用中的實際效能和多功能性。在此,我們提出 Triad,一個適用於 3D MRI 的視覺基礎模型。Triad 採用廣泛使用的自動編碼器架構,從 131,170 個 3D MRI 體積中學習穩健的表示,並使用與器官無關的影像描述來約束視覺模式的語義分佈。上述預訓練資料集稱為 Triad-131K,目前是最大的 3D MRI 預訓練資料集。我們在三個任務中評估 Triad,即器官/腫瘤分割、器官/癌症分類和醫學影像配準,在兩個資料模式(域內和域外)設定中使用 25 個下游資料集。透過使用 Triad 的預訓練權重初始化模型,nnUNet-Triad 在 17 個資料集中的分割效能比 nnUNet-Scratch 提升了 6.88%。Swin-B-Triad 在五個資料集的分類任務中,比 Swin-B-Scratch 提升了 3.97%。SwinUNETR-Triad 在兩個資料集的配準任務中,比 SwinUNETR-Scratch 提升了 4.00%。我們的研究證明,當上游和下游任務的資料模式和器官一致時,預訓練可以最大化效能。 -##### **Complex Ontology Matching with Large Language Model Embeddings** -2502.13619v1 by Guilherme Sousa, Rinaldo Lima, Cassia Trojahn +##### **VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare** +2502.13775v1 by Anudeex Shetty, Amin Beheshti, Mark Dras, Usman Naseem -Ontology, and more broadly, Knowledge Graph Matching is a challenging task in -which expressiveness has not been fully addressed. Despite the increasing use -of embeddings and language models for this task, approaches for generating -expressive correspondences still do not take full advantage of these models, in -particular, large language models (LLMs). This paper proposes to integrate LLMs -into an approach for generating expressive correspondences based on alignment -need and ABox-based relation discovery. The generation of correspondences is -performed by matching similar surroundings of instance sub-graphs. The -integration of LLMs results in different architectural modifications, including -label similarity, sub-graph matching, and entity matching. The performance word -embeddings, sentence embeddings, and LLM-based embeddings, was compared. The -results demonstrate that integrating LLMs surpasses all other models, enhancing -the baseline version of the approach with a 45\% increase in F-measure. +Alignment techniques have become central to ensuring that Large Language +Models (LLMs) generate outputs consistent with human values. However, existing +alignment paradigms often model an averaged or monolithic preference, failing +to account for the diversity of perspectives across cultures, demographics, and +communities. This limitation is particularly critical in health-related +scenarios, where plurality is essential due to the influence of culture, +religion, personal values, and conflicting opinions. Despite progress in +pluralistic alignment, no prior work has focused on health, likely due to the +unavailability of publicly available datasets. To address this gap, we +introduce VITAL, a new benchmark dataset comprising 13.1K value-laden +situations and 5.4K multiple-choice questions focused on health, designed to +assess and benchmark pluralistic alignment methodologies. Through extensive +evaluation of eight LLMs of varying sizes, we demonstrate that existing +pluralistic alignment techniques fall short in effectively accommodating +diverse healthcare beliefs, underscoring the need for tailored AI alignment in +specific domains. This work highlights the limitations of current approaches +and lays the groundwork for developing health-specific alignment solutions. -摘要:本体论,更广泛地说,知识图谱匹配是一项具有挑战性的任务,其中表达力尚未得到充分解决。尽管越来越多地使用嵌入和语言模型来完成此任务,但生成表达性对应关系的方法仍然没有充分利用这些模型,特别是大型语言模型 (LLM)。本文提出将 LLM 集成到一种基于对齐需求和基于 ABox 的关系发现来生成表达性对应关系的方法中。对应关系的生成是通过匹配实例子图的相似周围环境来执行的。LLM 的集成导致了不同的架构修改,包括标签相似性、子图匹配和实体匹配。比较了单词嵌入、句子嵌入和基于 LLM 的嵌入的性能。结果表明,集成 LLM 超越了所有其他模型,通过 F-measure 提高了 45% 的基准版本的方法。 +摘要:對齊技術已成為確保大型語言模型 (LLM) 產生與人類價值觀一致的輸出的核心。然而,現有的對齊範例通常會建模平均或單一的偏好,無法考量跨文化、人口統計和社群的不同觀點。此限制在與健康相關的場景中特別重要,因為在這種場景中,由於文化、宗教、個人價值觀和相互衝突的意見的影響,多元性是必要的。儘管多元對齊已取得進展,但沒有任何先前的工作專注於健康,這可能是因為缺乏公開可用的資料集。為了解決此差距,我們引入了 VITAL,這是一個新的基準資料集,包含 13.1K 個價值觀念的情境和 5.4K 個選擇題,專注於健康,旨在評估和基準多元對齊方法。透過對八個不同規模的 LLM 進行廣泛評估,我們證明現有的多元對齊技術無法有效適應不同的醫療保健信念,這強調了在特定領域中需要量身打造的 AI 對齊。這項工作突顯了當前方法的限制,並為開發特定於健康的對齊解決方案奠定了基礎。 -##### **Are Large Language Models In-Context Graph Learners?** -2502.13562v1 by Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Liang Chen, Zibin Zheng +##### **PeerQA: A Scientific Question Answering Dataset from Peer Reviews** +2502.13668v1 by Tim Baumgärtner, Ted Briscoe, Iryna Gurevych -Large language models (LLMs) have demonstrated remarkable in-context -reasoning capabilities across a wide range of tasks, particularly with -unstructured inputs such as language or images. However, LLMs struggle to -handle structured data, such as graphs, due to their lack of understanding of -non-Euclidean structures. As a result, without additional fine-tuning, their -performance significantly lags behind that of graph neural networks (GNNs) in -graph learning tasks. In this paper, we show that learning on graph data can be -conceptualized as a retrieval-augmented generation (RAG) process, where -specific instances (e.g., nodes or edges) act as queries, and the graph itself -serves as the retrieved context. Building on this insight, we propose a series -of RAG frameworks to enhance the in-context learning capabilities of LLMs for -graph learning tasks. Comprehensive evaluations demonstrate that our proposed -RAG frameworks significantly improve LLM performance on graph-based tasks, -particularly in scenarios where a pretrained LLM must be used without -modification or accessed via an API. +We present PeerQA, a real-world, scientific, document-level Question +Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, +which contain questions that reviewers raised while thoroughly examining the +scientific article. Answers have been annotated by the original authors of each +paper. The dataset contains 579 QA pairs from 208 academic articles, with a +majority from ML and NLP, as well as a subset of other scientific communities +like Geoscience and Public Health. PeerQA supports three critical tasks for +developing practical QA systems: Evidence retrieval, unanswerable question +classification, and answer generation. We provide a detailed analysis of the +collected dataset and conduct experiments establishing baseline systems for all +three tasks. Our experiments and analyses reveal the need for +decontextualization in document-level retrieval, where we find that even simple +decontextualization approaches consistently improve retrieval performance +across architectures. On answer generation, PeerQA serves as a challenging +benchmark for long-context modeling, as the papers have an average size of 12k +tokens. Our code and data is available at https://github.com/UKPLab/peerqa. -摘要:大型語言模型 (LLM) 在廣泛的任務中展示了非凡的語境推理能力,特別是對於語言或影像等非結構化輸入。然而,LLM 難以處理結構化資料,例如圖形,因為它們無法理解非歐幾何結構。因此,在沒有額外微調的情況下,它們在圖形學習任務中的表現遠遠落後於圖形神經網路 (GNN)。在本文中,我們展示了在圖形資料上學習可以被概念化為檢索增強生成 (RAG) 過程,其中特定實例(例如,節點或邊)充當查詢,而圖形本身則作為檢索的語境。基於這個見解,我們提出了一系列 RAG 架構,以增強 LLM 在圖形學習任務中的語境學習能力。全面的評估表明,我們提出的 RAG 架構顯著提升了 LLM 在基於圖形的任務上的表現,特別是在預訓練的 LLM 必須在不修改或透過 API 存取的情況下使用的場景中。 +摘要:我們提出 PeerQA,一個真實世界、科學的、文件層級的問答 (QA) 資料集。PeerQA 問題來自於同行評審,其中包含審查者在徹底審查科學文章時提出的問題。答案是由每篇論文的原始作者註解的。此資料集包含來自 208 篇學術文章的 579 個 QA 對,其中大部分來自 ML 和 NLP,以及其他科學社群(例如地球科學和公共衛生)的子集。PeerQA 支援開發實用 QA 系統的三項重要任務:證據檢索、無解答問題分類和答案產生。我們提供收集到的資料集的詳細分析,並進行實驗,為所有三項任務建立基準系統。我們的實驗和分析揭示了在文件層級檢索中去脈絡化的必要性,我們發現即使是簡單的去脈絡化方法也能持續改善跨架構的檢索效能。在答案產生方面,PeerQA 是一個用於長脈絡建模的具挑戰性基準,因為論文的平均大小為 12k 個符號。我們的程式碼和資料可於 https://github.com/UKPLab/peerqa 取得。 ##### **Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs** 2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu @@ -8056,550 +8008,583 @@ knowledge, leading to enhanced predictive performance and interpretability. 摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。 -##### **PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference** -2502.13502v1 by Burc Gokden +##### **MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis** +2502.13524v1 by Wei Dai, Steven Wang, Jun Liu -We show that Large Language Model from Power Law Decoder Representations -(PLDR-LLM) is a foundational model whose deductive outputs are invariant -tensors up to a small perturbation. PLDR-LLM learns a singularity condition for -the deductive outputs that enable the once-inferred energy-curvature tensor -$\mathbf{G}_{LM}$ to replace the deep neural network of power law graph -attention (PLGA) generating the deductive outputs at inference. We demonstrate -that a cache for $\mathbf{G}_{LM}$ (G-cache) and KV-cache can be implemented in -a straightforward manner to improve the inference time. The invariance and -generalizable nature of deductive outputs is at a very high fidelity where -deductive outputs have same RMSE and determinant values up to 15 decimal places -after caching, and zero-shot benchmark scores remain unchanged. Ablation -studies show that learned deductive outputs have distinct loss and accuracy -characteristics from models pretrained with transferred, randomly initialized -or identity tensors as a constant tensor operator and an LLM with scaled-dot -product attention (SDPA) is a special case of PLDR-LLM where $\mathbf{G}_{LM}$ -is predefined as identity. The observed invariance characteristic introduces a -novel asymmetry between training and inference phases with caching. We outline -observed common characteristics of the deductive outputs for the learned -singularity condition. We provide an implementation of a training and inference -framework for PLDR-LLM with KV-cache and G-cache. +Efficient evaluation of three-dimensional (3D) medical images is crucial for +diagnostic and therapeutic practices in healthcare. Recent years have seen a +substantial uptake in applying deep learning and computer vision to analyse and +interpret medical images. Traditional approaches, such as convolutional neural +networks (CNNs) and vision transformers (ViTs), face significant computational +challenges, prompting the need for architectural advancements. Recent efforts +have led to the introduction of novel architectures like the ``Mamba'' model as +alternative solutions to traditional CNNs or ViTs. The Mamba model excels in +the linear processing of one-dimensional data with low computational demands. +However, Mamba's potential for 3D medical image analysis remains underexplored +and could face significant computational challenges as the dimension increases. +This manuscript presents MobileViM, a streamlined architecture for efficient +segmentation of 3D medical images. In the MobileViM network, we invent a new +dimension-independent mechanism and a dual-direction traversing approach to +incorporate with a vision-Mamba-based framework. MobileViM also features a +cross-scale bridging technique to improve efficiency and accuracy across +various medical imaging modalities. With these enhancements, MobileViM achieves +segmentation speeds exceeding 90 frames per second (FPS) on a single graphics +processing unit (i.e., NVIDIA RTX 4090). This performance is over 24 FPS faster +than the state-of-the-art deep learning models for processing 3D images with +the same computational resources. In addition, experimental evaluations +demonstrate that MobileViM delivers superior performance, with Dice similarity +scores reaching 92.72%, 86.69%, 80.46%, and 77.43% for PENGWIN, BraTS2024, +ATLAS, and Toothfairy2 datasets, respectively, which significantly surpasses +existing models. -摘要:我們展示了來自冪律解碼器表示 (PLDR-LLM) 的大型語言模型是一個基礎模型,其演繹輸出是直到一個小擾動的不變張量。PLDR-LLM 學習演繹輸出的奇異條件,使曾經推斷出的能量曲率張量 $\mathbf{G}_{LM}$ 能夠取代產生演繹輸出的冪律圖注意力 (PLGA) 深度神經網路,進行推論。我們證明了 $\mathbf{G}_{LM}$ 快取 (G 快取) 和 KV 快取能夠以一種直接的方式實作,以改善推論時間。演繹輸出的不變性和可概化性質具有非常高的保真度,其中演繹輸出在快取後具有相同的 RMSE 和行列式值,直到小數點後 15 位,且零次學習基準分數保持不變。消融研究表明,學習的演繹輸出具有與使用轉移、隨機初始化或恆等張量作為常數張量算子和具有縮放點積注意力的 LLM 預先訓練的模型不同的損失和準確性特徵,並且 $\mathbf{G}_{LM}$ 被預先定義為恆等的 PLDR-LLM 的一個特例,其中 $\mathbf{G}_{LM}$ 被預先定義為恆等。觀察到的不變特徵引入了訓練和推論階段之間一個新的不對稱性,並帶有快取。我們概述了學習的奇異條件演繹輸出的觀察到的共同特徵。我們提供了一個具有 KV 快取和 G 快取的 PLDR-LLM 訓練和推論框架的實作。 +摘要:有效評估三維 (3D) 醫學影像對於醫療保健中的診斷和治療實務至關重要。近年來,將深度學習和電腦視覺應用於分析和詮釋醫學影像的應用大幅增加。傳統方法,例如卷積神經網路 (CNN) 和視覺Transformer (ViT),面臨重大的運算挑戰,促使需要架構上的進步。最近的努力已導致引進創新的架構,例如「Mamba」模型,作為傳統 CNN 或 ViT 的替代解決方案。Mamba 模型擅長以低運算需求進行一維資料的線性處理。然而,Mamba 在 3D 醫學影像分析方面的潛力仍未被充分探索,並且隨著維度的增加可能會面臨重大的運算挑戰。本手稿提出 MobileViM,這是一種簡化的架構,可有效分割 3D 醫學影像。在 MobileViM 網路中,我們發明了一種新的與維度無關的機制和雙向遍歷方法,以與基於視覺 Mamba 的架構結合。MobileViM 還具備跨尺度橋接技術,以提高各種醫學影像模式的效率和準確性。透過這些增強功能,MobileViM 在單一顯示卡 (即 NVIDIA RTX 4090) 上達到了每秒超過 90 幀 (FPS) 的分割速度。此效能比現有最先進的深度學習模型快了超過 24 FPS,這些模型使用相同的運算資源處理 3D 影像。此外,實驗評估證明 MobileViM 提供了卓越的效能,Dice 相似性評分對於 PENGWIN、BraTS2024、ATLAS 和 Toothfairy2 資料集分別達到 92.72%、86.69%、80.46% 和 77.43%,顯著超越現有模型。 -##### **Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction** -2502.13412v1 by Yanbang Sun, Qing Huang, Xiaoxue Ren, Zhenchang Xing, Xiaohong Li, Junjie Wang +##### **Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion** +2502.13509v1 by Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang -The API Knowledge Graph (API KG) is a structured network that models API -entities and their relations, providing essential semantic insights for tasks -such as API recommendation, code generation, and API misuse detection. However, -constructing a knowledge-rich and reliable API KG presents several challenges. -Existing schema-based methods rely heavily on manual annotations to design KG -schemas, leading to excessive manual overhead. On the other hand, schema-free -methods, due to the lack of schema guidance, are prone to introducing noise, -reducing the KG's reliability. To address these issues, we propose the -Explore-Construct-Filter framework, an automated approach for API KG -construction based on large language models (LLMs). This framework consists of -three key modules: 1) KG exploration: LLMs simulate the workflow of annotators -to automatically design a schema with comprehensive type triples, minimizing -human intervention; 2) KG construction: Guided by the schema, LLMs extract -instance triples to construct a rich yet unreliable API KG; 3) KG filtering: -Removing invalid type triples and suspicious instance triples to construct a -rich and reliable API KG. Experimental results demonstrate that our method -surpasses the state-of-the-art method, achieving a 25.2% improvement in F1 -score. Moreover, the Explore-Construct-Filter framework proves effective, with -the KG exploration module increasing KG richness by 133.6% and the KG filtering -module improving reliability by 26.6%. Finally, cross-model experiments confirm -the generalizability of our framework. +Large language models (LLMs) have shown remarkable performance in +vision-language tasks, but their application in the medical field remains +underexplored, particularly for integrating structured time series data with +unstructured clinical notes. In clinical practice, dynamic time series data +such as lab test results capture critical temporal patterns, while clinical +notes provide rich semantic context. Merging these modalities is challenging +due to the inherent differences between continuous signals and discrete text. +To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal +framework that employs prompt-guided learning to unify these heterogeneous data +types. Our approach leverages lightweight anomaly detection to generate anomaly +captions that serve as prompts, guiding the encoding of raw time series data +into informative embeddings. These embeddings are aligned with textual +representations in a shared latent space, preserving fine-grained temporal +nuances alongside semantic insights. Furthermore, our framework incorporates +tailored self-supervised objectives to enhance both intra- and inter-modal +alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world +datasets, and the results demonstrate that our method consistently outperforms +state-of-the-art approaches. -摘要:API 知識圖譜 (API KG) 是一個結構化網路,用於建模 API 實體及其關係,提供基本語義見解,以執行 API 建議、程式碼產生和 API 誤用偵測等任務。然而,建構一個知識豐富且可靠的 API KG 會產生若干挑戰。現有的基於架構的方法嚴重依賴手動註解來設計 KG 架構,導致過度的手動開銷。另一方面,由於缺乏架構指導,無架構的方法容易引入雜訊,降低 KG 的可靠性。為了解決這些問題,我們提出了探索建構過濾架構,這是一種基於大型語言模型 (LLM) 的自動化 API KG 建構方法。此架構包含三個關鍵模組:1) KG 探索:LLM 模擬註解者的工作流程,自動設計具有完整類型三元組的架構,將人為干預降至最低;2) KG 建構:在架構的指導下,LLM 提取實例三元組來建構豐富但不可靠的 API KG;3) KG 過濾:移除無效的類型三元組和可疑的實例三元組,以建構豐富且可靠的 API KG。實驗結果表明,我們的方法優於最先進的方法,在 F1 分數上提高了 25.2%。此外,探索建構過濾架構被證明是有效的,其中 KG 探索模組將 KG 豐富度提高了 133.6%,而 KG 過濾模組將可靠性提高了 26.6%。最後,跨模型實驗證實了我們架構的泛化性。 +摘要:大型語言模型(LLM)在視覺語言任務中表現出色,但其在醫療領域的應用仍未得到充分探索,特別是在將結構化時間序列數據與非結構化臨床筆記整合方面。在臨床實務中,動態時間序列數據(例如實驗室檢驗結果)會擷取關鍵的時間模式,而臨床筆記則提供豐富的語意脈絡。由於連續訊號與離散文字之間的固有差異,合併這些方式具有挑戰性。為了彌補這個差距,我們引入了 ProMedTS,這是一個新穎的自監督多模態框架,採用提示引導學習來統一這些異質化的數據類型。我們的做法利用輕量級異常偵測來產生異常標題,作為提示,引導將原始時間序列數據編碼成資訊性的嵌入。這些嵌入與共享潛在空間中的文字表示對齊,同時保留細微的時間差異和語意見解。此外,我們的框架納入了客製化的自監督目標,以增強模態內和模態間對齊。我們在疾病診斷任務中使用真實世界的數據集評估 ProMedTS,結果表明,我們的模型始終優於最先進的方法。 -##### **Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval** -2502.13369v1 by Aditya Sharma, Luis Lara, Amal Zouaq, Christopher J. Pal +##### **Towards a perturbation-based explanation for medical AI as differentiable programs** +2502.14001v1 by Takeshi Abe, Yoshiyuki Asai -The ability to generate SPARQL queries from natural language questions is -crucial for ensuring efficient and accurate retrieval of structured data from -knowledge graphs (KG). While large language models (LLMs) have been widely -adopted for SPARQL query generation, they are often susceptible to -hallucinations and out-of-distribution errors when producing KG elements like -Uniform Resource Identifiers (URIs) based on internal parametric knowledge. -This often results in content that appears plausible but is factually -incorrect, posing significant challenges for their use in real-world -information retrieval (IR) applications. This has led to increased research -aimed at detecting and mitigating such errors. In this paper, we introduce PGMR -(Post-Generation Memory Retrieval), a modular framework that incorporates a -non-parametric memory module to retrieve KG elements and enhance LLM-based -SPARQL query generation. Our experimental results indicate that PGMR -consistently delivers strong performance across diverse datasets, data -distributions, and LLMs. Notably, PGMR significantly mitigates URI -hallucinations, nearly eliminating the problem in several scenarios. +Recent advancement in machine learning algorithms reaches a point where +medical devices can be equipped with artificial intelligence (AI) models for +diagnostic support and routine automation in clinical settings. In medicine and +healthcare, there is a particular demand for sufficient and objective +explainability of the outcome generated by AI models. However, AI models are +generally considered as black boxes due to their complexity, and the +computational process leading to their response is often opaque. Although +several methods have been proposed to explain the behavior of models by +evaluating the importance of each feature in discrimination and prediction, +they may suffer from biases and opacities arising from the scale and sampling +protocol of the dataset used for training or testing. To overcome the +shortcomings of existing methods, we explore an alternative approach to provide +an objective explanation of AI models that can be defined independently of the +learning process and does not require additional data. As a preliminary study +for this direction of research, this work examines a numerical availability of +the Jacobian matrix of deep learning models that measures how stably a model +responses against small perturbations added to the input. The indicator, if +available, are calculated from a trained AI model for a given target input. +This is a first step towards a perturbation-based explanation, which will +assist medical practitioners in understanding and interpreting the response of +the AI model in its clinical application. -摘要:從自然語言問題中產生 SPARQL 查詢的能力對於確保從知識圖譜 (KG) 中有效率且準確地擷取結構化資料至關重要。儘管大型語言模型 (LLM) 已廣泛用於 SPARQL 查詢產生,但它們在根據內部參數化知識產生像統一資源識別碼 (URI) 等 KG 元素時,通常容易出現幻覺和分布外錯誤。這通常會導致內容看似合理,但事實上並不正確,對其在真實世界資訊檢索 (IR) 應用中的使用構成重大挑戰。這導致針對偵測和減輕此類錯誤的研究增加。在本文中,我們介紹 PGMR(後產生記憶體檢索),這是一個模組化架構,它結合了一個非參數記憶體模組來檢索 KG 元素並增強基於 LLM 的 SPARQL 查詢產生。我們的實驗結果表明,PGMR 在不同的資料集、資料分佈和 LLM 中始終提供強大的效能。值得注意的是,PGMR 大幅減輕了 URI 幻覺,在許多情況下幾乎消除了問題。 +摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 -##### **Craw4LLM: Efficient Web Crawling for LLM Pretraining** -2502.13347v1 by Shi Yu, Zhiyuan Liu, Chenyan Xiong +##### **RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering** +2502.13361v1 by Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou -Web crawl is a main source of large language models' (LLMs) pretraining data, -but the majority of crawled web pages are discarded in pretraining due to low -data quality. This paper presents Crawl4LLM, an efficient web crawling method -that explores the web graph based on the preference of LLM pretraining. -Specifically, it leverages the influence of a webpage in LLM pretraining as the -priority score of the web crawler's scheduler, replacing the standard graph -connectivity based priority. Our experiments on a web graph containing 900 -million webpages from a commercial search engine's index demonstrate the -efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just -21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream -performances of previous crawls, significantly reducing the crawling waste and -alleviating the burdens on websites. Our code is publicly available at -https://github.com/cxcscmu/Crawl4LLM. +Medical question answering requires extensive access to specialized +conceptual knowledge. The current paradigm, Retrieval-Augmented Generation +(RAG), acquires expertise medical knowledge through large-scale corpus +retrieval and uses this knowledge to guide a general-purpose large language +model (LLM) for generating answers. However, existing retrieval approaches +often overlook the importance of factual knowledge, which limits the relevance +of retrieved conceptual knowledge and restricts its applicability in real-world +scenarios, such as clinical decision-making based on Electronic Health Records +(EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval +framework that retrieves both relevant factual and conceptual knowledge from +dual sources (i.e., EHRs and the corpus), allowing them to interact and refine +each another. Through extensive evaluation across three factual-aware medical +question answering benchmarks, RGAR establishes a new state-of-the-art +performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model +with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings +demonstrate the benefit of extracting factual knowledge for retrieval, which +consistently yields improved generation quality. -摘要:網路爬蟲是大型語言模型 (LLM) 預訓練資料的主要來源, -但大多數已爬取的網頁在預訓練中會因為資料品質低落而被捨棄。 -本文提出 Crawl4LLM,這是一種有效率的網路爬取方法, -它會根據 LLM 預訓練的偏好來探索網路圖。 -具體來說,它利用網頁在 LLM 預訓練中的影響力作為網路爬蟲排程器的優先分數, -取代標準的圖形連線優先順序。 -我們在一個包含來自商業搜尋引擎索引的 9 億個網頁的網路圖上進行的實驗, -證明了 Crawl4LLM 在取得高品質預訓練資料方面的效率。 -只爬取了 21% 的網址,以 Crawl4LLM 資料預訓練的 LLM 就達到了先前爬取的相同下游效能, -大幅減少了爬取浪費,並減輕了對網站的負擔。 -我們的程式碼已公開於 https://github.com/cxcscmu/Crawl4LLM。 +摘要:醫療問題解答需要大量取得專業概念知識。目前的典範,檢索增強生成(RAG),透過大規模語料庫檢索取得專業醫療知識,並使用此知識引導通用大型語言模型(LLM)來產生答案。然而,現有的檢索方法經常忽略事實知識的重要性,這會限制檢索到的概念知識的相關性,並限制其在現實世界情境中的適用性,例如基於電子健康記錄(EHR)的臨床決策制定。本文介紹 RGAR,一個遞迴生成增強檢索架構,從雙重來源(即 EHR 和語料庫)檢索相關的事實和概念知識,讓它們互動並互相精煉。透過在三個事實感知醫療問題解答基準上進行廣泛評估,RGAR 在醫療 RAG 系統中建立了新的最先進效能。值得注意的是,採用 RGAR 的 Llama-3.1-8B-Instruct 模型超越了規模大得多的 RAG 增強型 GPT-3.5。我們的研究結果證明了提取事實知識以進行檢索的好處,這會持續產生改善的生成品質。 -##### **K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction** -2502.13344v1 by Tassallah Abdullahi, Ioanna Gemou, Nihal V. Nayak, Ghulam Murtaza, Stephen H. Bach, Carsten Eickhoff, Ritambhara Singh +##### **Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance** +2502.13321v1 by Tejas Srinivasan, Jesse Thomason -Drug discovery is a complex and time-intensive process that requires -identifying and validating new therapeutic candidates. Computational approaches -using large-scale biomedical knowledge graphs (KGs) offer a promising solution -to accelerate this process. However, extracting meaningful insights from -large-scale KGs remains challenging due to the complexity of graph traversal. -Existing subgraph-based methods are tailored to graph neural networks (GNNs), -making them incompatible with other models, such as large language models -(LLMs). We introduce K-Paths, a retrieval framework that extracts structured, -diverse, and biologically meaningful paths from KGs. Integrating these paths -enables LLMs and GNNs to effectively predict unobserved drug-drug and -drug-disease interactions. Unlike traditional path-ranking approaches, K-Paths -retrieves and transforms paths into a structured format that LLMs can directly -process, facilitating explainable reasoning. K-Paths employs a diversity-aware -adaptation of Yen's algorithm to retrieve the K shortest loopless paths between -entities in an interaction query, prioritizing biologically relevant and -diverse relationships. Our experiments on benchmark datasets show that K-Paths -improves the zero-shot performance of Llama 8.1B's F1-score by 12.45 points on -drug repurposing and 13.42 points on interaction severity prediction. We also -show that Llama 70B achieves F1-score gains of 6.18 and 8.46 points, -respectively. K-Paths also improves the supervised training efficiency of -EmerGNN, a state-of-the-art GNN, by reducing KG size by 90% while maintaining -strong predictive performance. Beyond its scalability and efficiency, K-Paths -uniquely bridges the gap between KGs and LLMs, providing explainable rationales -for predicted interactions. These capabilities show that K-Paths is a valuable -tool for efficient data-driven drug discovery. +Trust biases how users rely on AI recommendations in AI-assisted +decision-making tasks, with low and high levels of trust resulting in increased +under- and over-reliance, respectively. We propose that AI assistants should +adapt their behavior through trust-adaptive interventions to mitigate such +inappropriate reliance. For instance, when user trust is low, providing an +explanation can elicit more careful consideration of the assistant's advice by +the user. In two decision-making scenarios -- laypeople answering science +questions and doctors making medical diagnoses -- we find that providing +supporting and counter-explanations during moments of low and high trust, +respectively, yields up to 38% reduction in inappropriate reliance and 20% +improvement in decision accuracy. We are similarly able to reduce over-reliance +by adaptively inserting forced pauses to promote deliberation. Our results +highlight how AI adaptation to user trust facilitates appropriate reliance, +presenting exciting avenues for improving human-AI collaboration. -摘要:藥物發現是一個複雜且耗時的過程,需要識別和驗證新的治療候選藥物。使用大型生物醫學知識圖譜 (KG) 的計算方法提供了一個有希望的解決方案來加速這個過程。然而,由於圖形遍歷的複雜性,從大型 KG 中提取有意義的見解仍然具有挑戰性。現有的子圖方法是針對圖神經網路 (GNN) 量身打造的,這使得它們與其他模型(例如大型語言模型 (LLM))不兼容。我們介紹了 K-Paths,這是一個檢索框架,它從 KG 中提取結構化、多樣化且具有生物意義的路徑。整合這些路徑使 LLM 和 GNN 能夠有效預測未觀察到的藥物-藥物和藥物-疾病交互。與傳統的路徑排序方法不同,K-Paths 檢索路徑並將其轉換為 LLM 可以直接處理的結構化格式,從而促進可解釋的推理。K-Paths 採用了 Yen 演算法的多樣性感知適應,以檢索交互查詢中實體之間的 K 個最短無環路徑,優先考慮生物相關且多樣化的關係。我們在基準資料集上的實驗表明,K-Paths 將 Llama 8.1B 的 F1 分數在藥物再利用上提高了 12.45 分,在交互嚴重性預測上提高了 13.42 分。我們還表明,Llama 70B 分別獲得了 6.18 分和 8.46 分的 F1 分數增益。K-Paths 還提高了最先進的 GNN EmerGNN 的監督訓練效率,同時將 KG 大小減少了 90%,同時保持強大的預測性能。除了其可擴展性和效率之外,K-Paths 獨特地彌合了 KG 和 LLM 之間的差距,為預測的交互提供了可解釋的依據。這些功能表明,K-Paths 是用於高效資料驅動藥物發現的寶貴工具。 +摘要:信任偏見影響使用者在 AI 輔助決策任務中如何依賴 AI 建議,信任程度低和高分別導致依賴不足和過度依賴。我們建議 AI 助理應透過信任適應式干預調整其行為,以減輕這種不適當的依賴。例如,當使用者信任度低時,提供解釋可以引發使用者更仔細地考慮助理的建議。在兩種決策情境中——外行人回答科學問題和醫生進行醫療診斷——我們發現,分別在信任度低和高的時刻提供支持性和反向解釋,可以將不適當的依賴降低多達 38%,並將決策準確性提高 20%。我們同樣能夠透過適應性地插入強制暫停來促進審議,以減少過度依賴。我們的結果強調 AI 如何適應使用者信任以促進適當的依賴,為改善人機協作提供了令人興奮的途徑。 -##### **Grounding LLM Reasoning with Knowledge Graphs** -2502.13247v1 by Alfonso Amayuelas, Joy Sain, Simerjot Kaur, Charese Smiley +##### **Prediction of Clinical Complication Onset using Neural Point Processes** +2502.13290v1 by Sachini Weerasekara, Sagar Kamarthi, Jacqueline Isaacs -Knowledge Graphs (KGs) are valuable tools for representing relationships -between entities in a structured format. Traditionally, these knowledge bases -are queried to extract specific information. However, question-answering (QA) -over such KGs poses a challenge due to the intrinsic complexity of natural -language compared to the structured format and the size of these graphs. -Despite these challenges, the structured nature of KGs can provide a solid -foundation for grounding the outputs of Large Language Models (LLMs), offering -organizations increased reliability and control. - Recent advancements in LLMs have introduced reasoning methods at inference -time to improve their performance and maximize their capabilities. In this -work, we propose integrating these reasoning strategies with KGs to anchor -every step or "thought" of the reasoning chains in KG data. Specifically, we -evaluate both agentic and automated search methods across several reasoning -strategies, including Chain-of-Thought (CoT), Tree-of-Thought (ToT), and -Graph-of-Thought (GoT), using GRBench, a benchmark dataset for graph reasoning -with domain-specific graphs. Our experiments demonstrate that this approach -consistently outperforms baseline models, highlighting the benefits of -grounding LLM reasoning processes in structured KG data. +Predicting medical events in advance within critical care settings is +paramount for patient outcomes and resource management. Utilizing predictive +models, healthcare providers can anticipate issues such as cardiac arrest, +sepsis, or respiratory failure before they manifest. Recently, there has been a +surge in research focusing on forecasting adverse medical event onsets prior to +clinical manifestation using machine learning. However, while these models +provide temporal prognostic predictions for the occurrence of a specific +adverse event of interest within defined time intervals, their interpretability +often remains a challenge. In this work, we explore the applicability of neural +temporal point processes in the context of adverse event onset prediction, with +the aim of explaining clinical pathways and providing interpretable insights. +Our experiments span six state-of-the-art neural point processes and six +critical care datasets, each focusing on the onset of distinct adverse events. +This work represents a novel application class of neural temporal point +processes in event prediction. -摘要:知識圖譜 (KG) 是以結構化格式表示實體之間關係的寶貴工具。傳統上,這些知識庫會被查詢以萃取特定資訊。然而,由於自然語言與結構化格式之間的內在複雜性,以及這些圖譜的規模,在這些 KG 上進行問答 (QA) 會構成挑戰。儘管有這些挑戰,KG 的結構化特性可以為大型語言模型 (LLM) 的輸出提供穩固的基礎,為組織提供更高的可靠性和控制力。 -LLM 的最新進展在推論時間引入了推理方法,以提升其效能並最大化其能力。在這項工作中,我們建議將這些推理策略與 KG 整合,以將推理鏈的每一步或「思考」錨定在 KG 資料中。具體來說,我們在多種推理策略中評估代理和自動化搜尋方法,包括思考鏈 (CoT)、思考樹 (ToT) 和思考圖 (GoT),使用 GRBench,這是一個針對圖形推理的基準資料集,其中包含特定領域的圖形。我們的實驗證明,這種方法始終優於基準模型,突顯了將 LLM 推理過程建立在結構化 KG 資料中的好處。 +摘要:在重症監護環境中預先預測醫療事件對於患者的預後和資源管理至關重要。利用預測模型,醫療保健提供者可以在心臟驟停、敗血症或呼吸衰竭等問題發生之前預測到這些問題。最近,專注於在臨床表現之前使用機器學習預測不良醫療事件發生的研究激增。然而,儘管這些模型為特定不良事件在定義的時間間隔內發生提供了時間預後預測,但它們的可解釋性仍然是一個挑戰。在這項工作中,我們探討了神經時間點過程在不良事件發作預測中的適用性,目的是解釋臨床途徑並提供可解釋的見解。我們的實驗涵蓋了六種最先進的神經點過程和六個重症監護資料集,每個資料集都專注於不同不良事件的發作。這項工作代表了神經時間點過程在事件預測中的一種新的應用類別。 -##### **Learning to Defer for Causal Discovery with Imperfect Experts** -2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin +##### **SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?** +2502.13233v1 by Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu -Integrating expert knowledge, e.g. from large language models, into causal -discovery algorithms can be challenging when the knowledge is not guaranteed to -be correct. Expert recommendations may contradict data-driven results, and -their reliability can vary significantly depending on the domain or specific -query. Existing methods based on soft constraints or inconsistencies in -predicted causal relationships fail to account for these variations in -expertise. To remedy this, we propose L2D-CD, a method for gauging the -correctness of expert recommendations and optimally combining them with -data-driven causal discovery results. By adapting learning-to-defer (L2D) -algorithms for pairwise causal discovery (CD), we learn a deferral function -that selects whether to rely on classical causal discovery methods using -numerical data or expert recommendations based on textual meta-data. We -evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its -superior performance compared to both the causal discovery method and the -expert used in isolation. Moreover, our approach identifies domains where the -expert's performance is strong or weak. Finally, we outline a strategy for -generalizing this approach to causal discovery on graphs with more than two -variables, paving the way for further research in this area. +Large Language Models (LLMs) have shown remarkable capabilities in general +domains but often struggle with tasks requiring specialized knowledge. +Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve +external information from static knowledge bases, which can be outdated or +incomplete, missing fine-grained clinical details essential for accurate +medical question answering. In this work, we propose SearchRAG, a novel +framework that overcomes these limitations by leveraging real-time search +engines. Our method employs synthetic query generation to convert complex +medical questions into search-engine-friendly queries and utilizes +uncertainty-based knowledge selection to filter and incorporate the most +relevant and informative medical knowledge into the LLM's input. Experimental +results demonstrate that our method significantly improves response accuracy in +medical question answering tasks, particularly for complex questions requiring +detailed and up-to-date knowledge. -摘要:整合专家知識,例如從大型語言模型中整合到因果發現演算法中,當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾,而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點,我們提出了 L2D-CD,一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD),我們學習了一個延遲函數,用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD,並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外,我們的做法識別出專家表現強或弱的領域。最後,我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略,為此領域的進一步研究鋪平了道路。 +摘要:大型語言模型 (LLM) 在一般領域展現出驚人的能力,但經常在需要專業知識的任務中掙扎。 +傳統的檢索增強生成 (RAG) 技術通常從靜態知識庫中檢索外部資訊,這些資訊可能過時或不完整,缺少準確回答醫療問題所需的細微臨床細節。在這項工作中,我們提出 SearchRAG,這是一種新穎的架構,透過利用即時搜尋引擎克服這些限制。我們的模型採用合成查詢生成,將複雜的醫療問題轉換成搜尋引擎友善的查詢,並利用基於不確定性的知識選擇來過濾和納入 LLM 輸入中最相關且最有資訊的醫療知識。實驗結果證明,我們的模型顯著改善了醫療問題回答任務中的回應準確度,特別是需要詳細且最新的知識的複雜問題。 -##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks** -2502.13025v1 by Markus J. Buehler +##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions** +2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić -We present an agentic, autonomous graph expansion framework that iteratively -structures and refines knowledge in situ. Unlike conventional knowledge graph -construction methods relying on static extraction or single-pass learning, our -approach couples a reasoning-native large language model with a continually -updated graph representation. At each step, the system actively generates new -concepts and relationships, merges them into a global graph, and formulates -subsequent prompts based on its evolving structure. Through this -feedback-driven loop, the model organizes information into a scale-free network -characterized by hub formation, stable modularity, and bridging nodes that link -disparate knowledge clusters. Over hundreds of iterations, new nodes and edges -continue to appear without saturating, while centrality measures and shortest -path distributions evolve to yield increasingly distributed connectivity. Our -analysis reveals emergent patterns, such as the rise of highly connected 'hub' -concepts and the shifting influence of 'bridge' nodes, indicating that agentic, -self-reinforcing graph construction can yield open-ended, coherent knowledge -structures. Applied to materials design problems, we present compositional -reasoning experiments by extracting node-specific and synergy-level principles -to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that -transcend rote summarization and strengthen the framework's potential for -open-ended scientific discovery. We discuss other applications in scientific -discovery and outline future directions for enhancing scalability and -interpretability. +We present an end-to-end framework for generating synthetic users for +evaluating interactive agents designed to encourage positive behavior changes, +such as in health and lifestyle coaching. The synthetic users are grounded in +health and lifestyle conditions, specifically sleep and diabetes management in +this study, to ensure realistic interactions with the health coaching agent. +Synthetic users are created in two stages: first, structured data are generated +grounded in real-world health and lifestyle factors in addition to basic +demographics and behavioral attributes; second, full profiles of the synthetic +users are developed conditioned on the structured data. Interactions between +synthetic users and the coaching agent are simulated using generative +agent-based models such as Concordia, or directly by prompting a language +model. Using two independently-developed agents for sleep and diabetes coaching +as case studies, the validity of this framework is demonstrated by analyzing +the coaching agent's understanding of the synthetic users' needs and +challenges. Finally, through multiple blinded evaluations of user-coach +interactions by human experts, we demonstrate that our synthetic users with +health and behavioral attributes more accurately portray real human users with +the same attributes, compared to generic synthetic users not grounded in such +attributes. The proposed framework lays the foundation for efficient +development of conversational agents through extensive, realistic, and grounded +simulated interactions. -摘要:我們提出一個能動的、自主的圖形擴展框架,它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同,我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中,系統主動產生新的概念和關係,將它們合併到一個全域圖形中,並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈,模型將資訊組織成一個無標度網路,其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中,新的節點和邊緣會持續出現,而不會飽和,同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式,例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移,這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題,我們提出組合推理實驗,透過提取特定於節點的原則和協同效應層級原則,以促進真正新穎的知識綜合,產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用,並概述了增強可擴充性和可解釋性的未來方向。 +摘要:我們提供了一個端到端的架構,用於為評估互動式代理生成合成使用者,這些代理旨在鼓勵正向行為改變,例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎,特別是本研究中的睡眠和糖尿病管理,以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立:首先,除了基本人口統計資料和行為屬性外,還會產生以現實世界的健康和生活方式因素為基礎的結構化資料;其次,會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型(例如 Concordia)模擬的,或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究,通過分析指導代理對合成使用者需求和挑戰的理解,證明了此架構的有效性。最後,通過人類專家對使用者指導互動進行多重盲測評估,我們證明了與未以這些屬性為基礎的通用合成使用者相比,具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動,為對話代理的有效開發奠定了基礎。 -##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge** -2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany +##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization** +2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar -Large Language Models (LLMs) have significantly advanced medical -question-answering by leveraging extensive clinical data and medical -literature. However, the rapid evolution of medical knowledge and the -labor-intensive process of manually updating domain-specific resources pose -challenges to the reliability of these systems. To address this, we introduce -Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates -the construction and continuous updating of medical knowledge graphs, -integrates reasoning, and retrieves current external evidence, such as PubMed -and WikiSearch. By dynamically linking new findings and complex medical -concepts, AMG-RAG not only improves accuracy but also enhances interpretability -in medical queries. - Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness -of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of -66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to -100 times larger. Notably, these improvements are achieved without increasing -computational overhead, highlighting the critical role of automated knowledge -graph generation and external evidence retrieval in delivering up-to-date, -trustworthy medical insights. +Clinical Question Answering (CQA) plays a crucial role in medical +decision-making, enabling physicians to extract relevant information from +Electronic Medical Records (EMRs). While transformer-based models such as BERT, +BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in +CQA, existing models lack the ability to categorize extracted answers, which is +critical for structured retrieval, content filtering, and medical decision +support. + To address this limitation, we introduce a Multi-Task Learning (MTL) +framework that jointly trains CQA models for both answer extraction and medical +categorization. In addition to predicting answer spans, our model classifies +responses into five standardized medical categories: Diagnosis, Medication, +Symptoms, Procedure, and Lab Reports. This categorization enables more +structured and interpretable outputs, making clinical QA models more useful in +real-world healthcare settings. + We evaluate our approach on emrQA, a large-scale dataset for medical question +answering. Results show that MTL improves F1-score by 2.2% compared to standard +fine-tuning, while achieving 90.7% accuracy in answer categorization. These +findings suggest that MTL not only enhances CQA performance but also introduces +an effective mechanism for categorization and structured medical information +retrieval. -摘要:大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻,大幅提升了醫療問題解答的進步。然而,醫療知識的快速演進和手動更新特定領域資源的繁複程序,對這些系統的可靠性構成挑戰。為了解決這個問題,我們引入了適應性醫療圖表 RAG (AMG-RAG),這是一個自動化建構和持續更新醫療知識圖表的綜合架構,整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念,AMG-RAG 不僅提升了準確性,也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性,在 MEDQA 上達到了 74.1% 的 F1 分數,在 MEDMCQA 上達到了 66.34% 的準確度,優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是,這些改進是在不增加運算負擔的情況下實現的,突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。 +摘要:臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色,讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能,但現有的模型缺乏分類擷取答案的能力,這對於結構化檢索、內容過濾和醫療決策支援至關重要。 + 為了解決這個限制,我們引進了一個多任務學習 (MTL) 架構,它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍,我們的模型將回應分類為五個標準化醫療類別:診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出,讓臨床問答模型在真實世界的醫療保健環境中更實用。 + 我們在 emrQA 上評估我們的做法,emrQA 是用於醫療問題解答的大規模資料集。結果顯示,與標準微調相比,MTL 將 F1 分數提高了 2.2%,同時在答案分類中達到 90.7% 的準確度。這些發現表明,MTL 不僅增強了 CQA 的效能,還引入了一種分類和結構化醫療資訊檢索的有效機制。 -##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs** -2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi +##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection** +2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert -Recent studies have combined Large Language Models (LLMs) with Knowledge -Graphs (KGs) to enhance reasoning, improving inference accuracy without -additional training while mitigating hallucination. However, existing -frameworks are often rigid, struggling to adapt to KG or task changes. They -also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. -To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that -separates reasoning into two roles: an Operator (a low-capacity LLM) that -gathers evidence and a Supervisor (a high-capacity LLM) that makes final -judgments. This design is cost-efficient for LLM inference while still -maintaining strong reasoning accuracy. Additionally, R2-KG employs an -Abstention mechanism, generating answers only when sufficient evidence is -collected from KG, which significantly enhances reliability. Experiments across -multiple KG-based reasoning tasks show that R2-KG consistently outperforms -baselines in both accuracy and reliability, regardless of the inherent -capability of LLMs used as the Operator. Further experiments reveal that the -single-agent version of R2-KG, equipped with a strict self-consistency -strategy, achieves significantly higher-than-baseline reliability while -reducing inference cost. However, it also leads to a higher abstention rate in -complex KGs. Our findings establish R2-KG as a flexible and cost-effective -solution for KG-based reasoning. It reduces reliance on high-capacity LLMs -while ensuring trustworthy inference. +Detection of hyperenhancement from cardiac LGE MRI images is a complex task +requiring significant clinical expertise. Although deep learning-based models +have shown promising results for the task, they require large amounts of data +with fine-grained annotations. Clinical reports generated for cardiac MR +studies contain rich, clinically relevant information, including the location, +extent and etiology of any scars present. Although recently developed +CLIP-based training enables pretraining models with image-text pairs, it +requires large amounts of data and further finetuning strategies on downstream +tasks. In this study, we use various strategies rooted in domain knowledge to +train a model for LGE detection solely using text from clinical reports, on a +relatively small clinical cohort of 965 patients. We improve performance +through the use of synthetic data augmentation, by systematically creating scar +images and associated text. In addition, we standardize the orientation of the +images in an anatomy-informed way to enable better alignment of spatial and +text features. We also use a captioning loss to enable fine-grained supervision +and explore the effect of pretraining of the vision encoder on performance. +Finally, ablation studies are carried out to elucidate the contributions of +each design component to the overall performance of the model. -摘要:最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理,在不额外训练的情况下提高推理准确性,同时减轻幻觉。然而,现有的框架通常很僵化,难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠(即值得信赖)的推理。为了解决这个问题,我们引入了 R2-KG,这是一个即插即用、双代理框架,它将推理分为两个角色:一个收集证据的操作员(低容量 LLM)和一个做出最终判断的监督员(高容量 LLM)。这种设计在 LLM 推理方面具有成本效益,同时仍保持强大的推理准确性。此外,R2-KG 采用弃权机制,仅在从知识图谱收集到足够证据时才生成答案,这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明,R2-KG 在准确性和可靠性方面始终优于基线,而与用作操作员的 LLM 的固有能力无关。进一步的实验表明,R2-KG 的单代理版本配备了严格的自一致性策略,实现了明显高于基线的可靠性,同时降低了推理成本。然而,它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖,同时确保了可信的推理。 +摘要:從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務,需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果,但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊,包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型,但它需要大量資料和進一步微調下游任務的策略。在這項研究中,我們使用植基於領域知識的各種策略,僅使用來自臨床報告的文字,在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能,系統性地建立疤痕影像和相關文字。此外,我們以解剖學告知的方式標準化影像方向,以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督,並探討視覺編碼器的預訓練對效能的影響。最後,進行消融研究以闡明每個設計元件對模型整體效能的貢獻。 -##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research** -2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang +##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models** +2502.12825v2 by Rubing Li, João Sedoc, Arun Sundararajan -The rapid advancement of perovskite solar cells (PSCs) has led to an -exponential growth in research publications, creating an urgent need for -efficient knowledge management and reasoning systems in this domain. We present -a comprehensive knowledge-enhanced system for PSCs that integrates three key -components. First, we develop Perovskite-KG, a domain-specific knowledge graph -constructed from 1,517 research papers, containing 23,789 entities and 22,272 -relationships. Second, we create two complementary datasets: Perovskite-Chat, -comprising 55,101 high-quality question-answer pairs generated through a novel -multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully -curated materials science problems. Third, we introduce two specialized large -language models: Perovskite-Chat-LLM for domain-specific knowledge assistance -and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental -results demonstrate that our system significantly outperforms existing models -in both domain-specific knowledge retrieval and scientific reasoning tasks, -providing researchers with effective tools for literature review, experimental -design, and complex problem-solving in PSC research. +When encountering increasingly frequent performance improvements or cost +reductions from a new large language model (LLM), developers of applications +leveraging LLMs must decide whether to take advantage of these improvements or +stay with older tried-and-tested models. Low perceived switching frictions can +lead to choices that do not consider more subtle behavior changes that the +transition may induce. Our experiments use a popular game-theoretic behavioral +economics model of trust to show stark differences in the trusting behavior of +OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust +behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing +and risk-seeking with future returns from trust, and contrast it with +DeepSeek's more sophisticated and profitable trusting behavior that stems from +an ability to incorporate deeper concepts like forward planning and +theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our +results highlight the perils of relying on LLM performance benchmarks that are +too narrowly defined and suggest that careful analysis of their hidden fault +lines should be part of any organization's AI strategy. -摘要:由於 perovskite 太陽能電池 (PSC) 快速進展,導致研究出版物呈指數成長,迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先,我們開發出 Perovskite-KG,一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次,我們建立兩個互補的資料集:Perovskite-Chat,包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對;以及 Perovskite-Reasoning,包含 2,217 個仔細策展的材料科學問題。第三,我們推出兩個專門化大型語言模型:針對領域特定知識協助的 Perovskite-Chat-LLM,以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示,我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型,為研究人員提供有效的工具,用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。 +摘要:在遇到大型語言模型 (LLM) 頻頻帶來的效能提升或成本降低時,利用 LLM 的應用程式開發人員必須決定是否要利用這些提升,或繼續使用較舊且經過驗證的模型。低感知切換摩擦可能會導致選擇,而沒有考慮轉換可能引發的更細微行為變更。我們的實驗使用流行的博弈論行為經濟信任模型,以顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰,因為它們調和了利潤最大化和冒險,以及來自信任的未來回報,並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比,這種行為源於整合更深入的概念,例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎,我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險,並建議仔細分析其隱藏的斷層線應該是任何組織 AI 策略的一部分。 -##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation** -2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li +##### **LLM Safety for Children** +2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat -Explainable recommendation has demonstrated significant advantages in -informing users about the logic behind recommendations, thereby increasing -system transparency, effectiveness, and trustworthiness. To provide -personalized and interpretable explanations, existing works often combine the -generation capabilities of large language models (LLMs) with collaborative -filtering (CF) information. CF information extracted from the user-item -interaction graph captures the user behaviors and preferences, which is crucial -for providing informative explanations. However, due to the complexity of graph -structure, effectively extracting the CF information from graphs still remains -a challenge. Moreover, existing methods often struggle with the integration of -extracted CF information with LLMs due to its implicit representation and the -modality gap between graph structures and natural language explanations. To -address these challenges, we propose G-Refer, a framework using graph -retrieval-augmented large language models (LLMs) for explainable -recommendation. Specifically, we first employ a hybrid graph retrieval -mechanism to retrieve explicit CF signals from both structural and semantic -perspectives. The retrieved CF information is explicitly formulated as -human-understandable text by the proposed graph translation and accounts for -the explanations generated by LLMs. To bridge the modality gap, we introduce -knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of -LLMs to process and utilize the retrieved CF information to generate -explanations. Extensive experiments show that G-Refer achieves superior -performance compared with existing methods in both explainability and -stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer. +This paper analyzes the safety of Large Language Models (LLMs) in +interactions with children below age of 18 years. Despite the transformative +applications of LLMs in various aspects of children's lives such as education +and therapy, there remains a significant gap in understanding and mitigating +potential content harms specific to this demographic. The study acknowledges +the diverse nature of children often overlooked by standard safety evaluations +and proposes a comprehensive approach to evaluating LLM safety specifically for +children. We list down potential risks that children may encounter when using +LLM powered applications. Additionally we develop Child User Models that +reflect the varied personalities and interests of children informed by +literature in child care and psychology. These user models aim to bridge the +existing gap in child safety literature across various fields. We utilize Child +User Models to evaluate the safety of six state of the art LLMs. Our +observations reveal significant safety gaps in LLMs particularly in categories +harmful to children but not adults -摘要:可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點,從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明,現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好,這對於提供資訊性說明至關重要。然而,由於圖形結構的複雜性,從圖形中有效提取 CF 資訊仍然是一個挑戰。此外,現有方法通常難以將提取的 CF 資訊與 LLM 整合,因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰,我們提出 G-Refer,一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說,我們首先採用混合圖形檢索機制,從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字,並說明 LLM 生成的解釋。為了彌合模式差距,我們引入了知識修剪和檢索增強微調,以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明,與現有方法相比,G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。 +摘要:本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面(例如教育和治療)都有轉變性的應用,但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性,而標準安全評估通常會忽略這些多樣性,並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外,我們開發了兒童使用者模型,這些模型反映了兒童不同的個性特質和興趣,並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞,特別是在對兒童有害但對成年人無害的類別中 -##### **A-MEM: Agentic Memory for LLM Agents** -2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang +##### **Classifiers of Data Sharing Statements in Clinical Trial Records** +2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth -While large language model (LLM) agents can effectively use external tools -for complex real-world tasks, they require memory systems to leverage -historical experiences. Current memory systems enable basic storage and -retrieval but lack sophisticated memory organization, despite recent attempts -to incorporate graph databases. Moreover, these systems' fixed operations and -structures limit their adaptability across diverse tasks. To address this -limitation, this paper proposes a novel agentic memory system for LLM agents -that can dynamically organize memories in an agentic way. Following the basic -principles of the Zettelkasten method, we designed our memory system to create -interconnected knowledge networks through dynamic indexing and linking. When a -new memory is added, we generate a comprehensive note containing multiple -structured attributes, including contextual descriptions, keywords, and tags. -The system then analyzes historical memories to identify relevant connections, -establishing links where meaningful similarities exist. Additionally, this -process enables memory evolution - as new memories are integrated, they can -trigger updates to the contextual representations and attributes of existing -historical memories, allowing the memory network to continuously refine its -understanding. Our approach combines the structured organization principles of -Zettelkasten with the flexibility of agent-driven decision making, allowing for -more adaptive and context-aware memory management. Empirical experiments on six -foundation models show superior improvement against existing SOTA baselines. -The source code is available at https://github.com/WujiangXu/AgenticMemory. +Digital individual participant data (IPD) from clinical trials are +increasingly distributed for potential scientific reuse. The identification of +available IPD, however, requires interpretations of textual data-sharing +statements (DSS) in large databases. Recent advancements in computational +linguistics include pre-trained language models that promise to simplify the +implementation of effective classifiers based on textual inputs. In a subset of +5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers +based on domain-specific pre-trained language models reproduce original +availability categories as well as manually annotated labels. Typical metrics +indicate that classifiers that predicted manual annotations outperformed those +that learned to output the original availability categories. This suggests that +the textual DSS descriptions contain applicable information that the +availability categories do not, and that such classifiers could thus aid the +automatic identification of available IPD in large trial databases. -摘要:大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務,但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索,但缺乏精密的記憶體組織,儘管最近嘗試納入圖形資料庫。此外,這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制,本文提出了一種新的代理記憶體系統,供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則,我們設計我們的記憶體系統,透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時,我們會產生包含多個結構化屬性的綜合筆記,包括脈絡描述、關鍵字和標籤。然後,系統會分析歷史記憶體以找出相關連結,在有意義的相似性時建立連結。此外,這個程序能讓記憶體演化,因為當整合新的記憶體時,它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新,讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性,能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。 +摘要:臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而,要找出可用的 IPD,需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型,有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中,我們評估了基於特定領域預先訓練語言模型的分類器,在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示,預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊,而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。 -##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs** -2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li +##### **Relational Norms for Human-AI Cooperation** +2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark -Large language models (LLMs) have demonstrated remarkable capabilities in -various complex tasks, yet they still suffer from hallucinations. Introducing -external knowledge, such as knowledge graph, can enhance the LLMs' ability to -provide factual answers. LLMs have the ability to interactively explore -knowledge graphs. However, most approaches have been affected by insufficient -internal knowledge excavation in LLMs, limited generation of trustworthy -knowledge reasoning paths, and a vague integration between internal and -external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large -model framework driven by the collaboration of internal and external knowledge. -It relies on the internal knowledge of the LLM to guide the exploration of -interpretable directed subgraphs in external knowledge graphs, better -integrating the two knowledge sources for more accurate reasoning. Extensive -experiments on multiple real-world datasets confirm the superiority of -KnowPath. +How we should design and interact with social artificial intelligence depends +on the socio-relational role the AI is meant to emulate or occupy. In human +society, relationships such as teacher-student, parent-child, neighbors, +siblings, or employer-employee are governed by specific norms that prescribe or +proscribe cooperative functions including hierarchy, care, transaction, and +mating. These norms shape our judgments of what is appropriate for each +partner. For example, workplace norms may allow a boss to give orders to an +employee, but not vice versa, reflecting hierarchical and transactional +expectations. As AI agents and chatbots powered by large language models are +increasingly designed to serve roles analogous to human positions - such as +assistant, mental health provider, tutor, or romantic partner - it is +imperative to examine whether and how human relational norms should extend to +human-AI interactions. Our analysis explores how differences between AI systems +and humans, such as the absence of conscious experience and immunity to +fatigue, may affect an AI's capacity to fulfill relationship-specific functions +and adhere to corresponding norms. This analysis, which is a collaborative +effort by philosophers, psychologists, relationship scientists, ethicists, +legal experts, and AI researchers, carries important implications for AI +systems design, user behavior, and regulation. While we accept that AI systems +can offer significant benefits such as increased availability and consistency +in certain socio-relational roles, they also risk fostering unhealthy +dependencies or unrealistic expectations that could spill over into human-human +relationships. We propose that understanding and thoughtfully shaping (or +implementing) suitable human-AI relational norms will be crucial for ensuring +that human-AI interactions are ethical, trustworthy, and favorable to human +well-being. -摘要:大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力,但仍會出現幻覺。引入外部知識(例如知識圖譜)可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而,大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限,以及內部和外部知識之間的整合模糊的影響。因此,我們提出 KnowPath,這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索,更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。 +摘要:我們應如何設計和與社交人工智慧互動,取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中,師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配,這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如,職場規範可能允許老闆對員工發號施令,但反之則不行,這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色,例如助理、心理健康提供者、導師或浪漫伴侶,審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異,例如缺乏意識體驗和對疲勞的免疫力,如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果,對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處,例如增加可用性和一致性,但它們也可能助長不健康的依賴關係或不切實際的期望,這些期望可能會蔓延到人際關係中。我們提出,理解和深思熟慮地塑造(或實施)適當的人類與人工智慧關係規範,對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。 -##### **Atom of Thoughts for Markov LLM Test-Time Scaling** -2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo +##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis** +2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy -Large Language Models (LLMs) achieve superior performance through -training-time scaling, and test-time scaling further enhances their -capabilities by conducting effective reasoning during inference. However, as -the scale of reasoning increases, existing test-time scaling methods suffer -from accumulated historical information, which not only wastes computational -resources but also interferes with effective reasoning. To address this issue, -we observe that complex reasoning progress is often achieved by solving a -sequence of independent subquestions, each being self-contained and verifiable. -These subquestions are essentially atomic questions, relying primarily on their -current state rather than accumulated history, similar to the memoryless -transitions in a Markov process. Based on this observation, we propose Atom of -Thoughts (AoT), where each state transition in the reasoning process consists -of decomposing the current question into a dependency-based directed acyclic -graph and contracting its subquestions, forming a new atomic question state. -This iterative decomposition-contraction process continues until reaching -directly solvable atomic questions, naturally realizing Markov transitions -between question states. Furthermore, these atomic questions can be seamlessly -integrated into existing test-time scaling methods, enabling AoT to serve as a -plug-in enhancement for improving reasoning capabilities. Experiments across -six benchmarks demonstrate the effectiveness of AoT both as a standalone -framework and a plug-in enhancement. Notably, on HotpotQA, when applied to -gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and -DeepSeek-R1 by 10.6%. The code will be available at -https://github.com/qixucen/atom. +Air quality prediction is key to mitigating health impacts and guiding +decisions, yet existing models tend to focus on temporal trends while +overlooking spatial generalization. We propose AQ-Net, a spatiotemporal +reanalysis model for both observed and unobserved stations in the near future. +AQ-Net utilizes the LSTM and multi-head attention for the temporal regression. +We also propose a cyclic encoding technique to ensure continuous time +representation. To learn fine-grained spatial air quality estimation, we +incorporate AQ-Net with the neural kNN to explore feature-based interpolation, +such that we can fill the spatial gaps given coarse observation stations. To +demonstrate the efficiency of our model for spatiotemporal reanalysis, we use +data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive +experiments show that AQ-Net excels in air quality reanalysis, highlighting the +potential of hybrid spatio-temporal models to better capture environmental +dynamics, especially in urban areas where both spatial and temporal variability +are critical. -摘要:大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能,而測試時間擴充透過在推論期間進行有效的推理,進一步提升其能力。然而,隨著推理規模的擴大,現有的測試時間擴充方法會受到累積的歷史資訊影響,這不僅會浪費運算資源,還會干擾有效的推理。為了解決這個問題,我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成,每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題,主要依賴於它們的當前狀態,而不是累積的歷史,類似於馬可夫過程中的無記憶轉換。基於這個觀察,我們提出了思想原子 (AoT),其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖,並收縮其子問題,形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行,直到達到可直接解決的原子問題,自然地實現問題狀態之間的馬可夫轉換。此外,這些原子問題可以無縫整合到現有的測試時間擴充方法中,讓 AoT 可以作為外掛程式強化功能,以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是,在 HotpotQA 上,當應用於 gpt-4o-mini 時,AoT 達到了 80.6% 的 F1 分數,比 o3-mini 高出 3.4%,比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。 +摘要:空气品质预测是减轻健康影响和指导决策的关键,但现有的模型倾向于关注时间趋势,而忽略空间概化。我们提出了 AQ-Net,这是一种时空再分析模型,适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计,我们将 AQ-Net 与神经 kNN 结合起来,以探索基于特征的插值,以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率,我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明,AQ-Net 在空气质量再分析中表现出色,突出了混合时空模型在更好地捕捉环境动态方面的潜力,尤其是在空间和时间变异性都很关键的城市地区。 -##### **Generating Text from Uniform Meaning Representation** -2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein +##### **Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing** +2502.11715v1 by Site Qu, Guoqiang Hu -Uniform Meaning Representation (UMR) is a recently developed graph-based -semantic representation, which expands on Abstract Meaning Representation (AMR) -in a number of ways, in particular through the inclusion of document-level -information and multilingual flexibility. In order to effectively adopt and -leverage UMR for downstream tasks, efforts must be placed toward developing a -UMR technological ecosystem. Though still limited amounts of UMR annotations -have been produced to date, in this work, we investigate the first approaches -to producing text from multilingual UMR graphs: (1) a pipeline conversion of -UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large -language models with UMR data, and (3) fine-tuning existing AMR-to-text -generation models with UMR data. Our best performing model achieves a -multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared -to the reference, which is a promising indication of the effectiveness of -fine-tuning approaches for UMR-to-text generation with even limited amounts of -UMR data. +The Location-Routing Problem (LRP), which combines the challenges of facility +(depot) locating and vehicle route planning, is critically constrained by the +reliance on predefined depot candidates, limiting the solution space and +potentially leading to suboptimal outcomes. Previous research on LRP without +predefined depots is scant and predominantly relies on heuristic algorithms +that iteratively attempt depot placements across a planar area. Such approaches +lack the ability to proactively generate depot locations that meet specific +geographic requirements, revealing a notable gap in current research landscape. +To bridge this gap, we propose a data-driven generative DRL framework, designed +to proactively generate depots for LRP without predefined depot candidates, +solely based on customer requests data which include geographic and demand +information. It can operate in two distinct modes: direct generation of exact +depot locations, and the creation of a multivariate Gaussian distribution for +flexible depots sampling. By extracting depots' geographic pattern from +customer requests data, our approach can dynamically respond to logistical +needs, identifying high-quality depot locations that further reduce total +routing costs compared to traditional methods. Extensive experiments +demonstrate that, for a same group of customer requests, compared with those +depots identified through random attempts, our framework can proactively +generate depots that lead to superior solution routes with lower routing cost. +The implications of our framework potentially extend into real-world +applications, particularly in emergency medical rescue and disaster relief +logistics, where rapid establishment and adjustment of depot locations are +paramount, showcasing its potential in addressing LRP for dynamic and +unpredictable environments. -摘要:統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示,它在許多方面擴展了抽象語意表示 (AMR),特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR,必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限,但在這項工作中,我們探討了從多語言 UMR 圖形產生文字的第一種方法:(1) 將 UMR 轉換為 AMR 的管道,然後使用 AMR 轉文字生成模型,(2) 使用 UMR 資料微調大型語言模型,以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比,我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數,在中文中達到 0.882,這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果,即使 UMR 資料數量有限。 +摘要:地點路線問題(LRP)結合了設施(倉庫)定位和車輛路線規劃的挑戰,嚴重受到預先定義的倉庫候選限制,限制了解決方案空間,並可能導致次優結果。先前關於沒有預先定義倉庫的 LRP 研究很少,而且主要依賴於啟發式演算法,在平面區域中反覆嘗試倉庫配置。這種方法無法主動產生符合特定地理需求的倉庫位置,顯示了當前研究領域的顯著差距。為了彌補這個差距,我們提出一個資料驅動的生成式 DRL 架構,旨在主動為 LRP 產生倉庫,而無需預先定義的倉庫候選,僅根據包含地理和需求資訊的客戶要求資料。它可以在兩種不同的模式下運作:直接產生確切的倉庫位置,以及建立多元高斯分布以進行彈性倉庫抽樣。透過從客戶要求資料中提取倉庫的地理模式,我們的方法可以動態回應後勤需求,找出高品質的倉庫位置,進一步降低與傳統方法相比的總路線成本。廣泛的實驗證明,對於同一組客戶要求,與透過隨機嘗試識別的那些倉庫相比,我們的架構可以主動產生倉庫,並產生路線成本較低的優質解決方案路線。我們的架構的影響潛在地擴展到實際應用,特別是在緊急醫療救援和災害救災後勤方面,其中倉庫位置的快速建立和調整至關重要,展示了其在解決動態和不可預測環境的 LRP 中的潛力。 -##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs** -2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han +##### **LLM Agents Making Agent Tools** +2502.11705v1 by Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather -The rapid development of Multimodal Large Language Models (MLLMs) has enabled -the integration of multiple modalities, including texts and images, within the -large language model (LLM) framework. However, texts and images are usually -interconnected, forming a multimodal attributed graph (MMAG). It is -underexplored how MLLMs can incorporate the relational information -(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts -and images) on such graphs for multimodal comprehension and generation. In this -paper, we propose GraphGPT-o, which supports omni-multimodal understanding and -creation on MMAGs. We first comprehensively study linearization variants to -transform semantic and structural information as input for MLLMs. Then, we -propose a hierarchical aligner that enables deep graph encoding, bridging the -gap between MMAGs and MLLMs. Finally, we explore the inference choices, -adapting MLLM to interleaved text and image generation in graph scenarios. -Extensive experiments on three datasets from different domains demonstrate the -effectiveness of our proposed method. Datasets and codes will be open-sourced -upon acceptance. +Tool use has turned large language models (LLMs) into powerful agents that +can perform complex multi-step tasks by dynamically utilising external software +components. However, these tools must be implemented in advance by human +developers, hindering the applicability of LLM agents in domains which demand +large numbers of highly specialised tools, like in life sciences and medicine. +Motivated by the growing trend of scientific studies accompanied by public code +repositories, we propose ToolMaker, a novel agentic framework that autonomously +transforms papers with code into LLM-compatible tools. Given a short task +description and a repository URL, ToolMaker autonomously installs required +dependencies and generates code to perform the task, using a closed-loop +self-correction mechanism to iteratively diagnose and rectify errors. To +evaluate our approach, we introduce a benchmark comprising 15 diverse and +complex computational tasks spanning both medical and non-medical domains with +over 100 unit tests to objectively assess tool correctness and robustness. +ToolMaker correctly implements 80% of the tasks, substantially outperforming +current state-of-the-art software engineering agents. ToolMaker therefore is a +step towards fully autonomous agent-based scientific workflows. -摘要:多模态大语言模型 (MLLM) 的快速发展,促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而,文本和图像通常是相互关联的,形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息(即图结构)和语义信息(即文本和图像)以进行多模态理解和生成,目前仍未得到充分探索。在本文中,我们提出了 GraphGPT-o,它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体,以将语义和结构信息转换为 MLLM 的输入。然后,我们提出了一个分层对齐器,它支持深度图编码,弥合了 MMAG 和 MLLM 之间的差距。最后,我们探索了推理选择,使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。 +摘要:工具使用已將大型語言模型 (LLM) 轉變為強大的代理,可透過動態使用外部軟體元件來執行複雜的多步驟任務。然而,這些工具必須事先由人類開發人員實作,這會阻礙 LLM 代理在需要大量高度專業化工具的領域(例如生命科學和醫學)中的應用性。受到伴隨公開程式碼儲存庫的科學研究趨勢所啟發,我們提出 ToolMaker,一個創新的代理架構,可自主地將帶有程式碼的論文轉換為相容於 LLM 的工具。給定簡短的任務描述和儲存庫網址,ToolMaker 會自主安裝所需的依賴項,並產生程式碼來執行任務,使用閉環自我修正機制來反覆診斷和糾正錯誤。為了評估我們的做法,我們引進一個包含 15 個不同且複雜的運算任務的基準,涵蓋醫療和非醫療領域,並包含超過 100 個單元測試,以客觀評估工具的正確性和穩健性。ToolMaker 正確實作了 80% 的任務,大幅優於目前的最新軟體工程代理。因此,ToolMaker 是邁向完全自主的基於代理的科學工作流程的一步。 -##### **Exploring LLM-based Student Simulation for Metacognitive Cultivation** -2502.11678v1 by Haoxuan Li, Jifan Yu, Xin Cong, Yang Dang, Yisi Zhan, Huiqin Liu, Zhiyuan Liu +##### **MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression** +2502.11651v1 by Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang -Metacognitive education plays a crucial role in cultivating students' -self-regulation and reflective thinking, providing essential support for those -with learning difficulties through academic advising. Simulating students with -insufficient learning capabilities using large language models offers a -promising approach to refining pedagogical methods without ethical concerns. -However, existing simulations often fail to authentically represent students' -learning struggles and face challenges in evaluation due to the lack of -reliable metrics and ethical constraints in data collection. To address these -issues, we propose a pipeline for automatically generating and filtering -high-quality simulated student agents. Our approach leverages a two-round -automated scoring system validated by human experts and employs a score -propagation module to obtain more consistent scores across the student graph. -Experimental results demonstrate that our pipeline efficiently identifies -high-quality student agents, and we discuss the traits that influence the -simulation's effectiveness. By simulating students with varying degrees of -learning difficulties, our work paves the way for broader applications in -personalized learning and educational assessment. +Large vision-language models (LVLMs) have shown great promise in medical +applications, particularly in visual question answering (MedVQA) and diagnosis +from medical images. However, existing datasets and models often fail to +consider critical aspects of medical diagnostics, such as the integration of +historical records and the analysis of disease progression over time. In this +paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel +dataset for MedVQA that focuses on identifying changes in specific regions +between two patient visits. Unlike previous datasets that primarily address +single-image questions, MMXU enables multi-image questions, incorporating both +current and historical patient data. We demonstrate the limitations of current +LVLMs in identifying disease progression on MMXU-\textit{test}, even those that +perform well on traditional benchmarks. To address this, we propose a +MedRecord-Augmented Generation (MAG) approach, incorporating both global and +regional historical records. Our experiments show that integrating historical +records significantly enhances diagnostic accuracy by at least 20\%, bridging +the gap between current LVLMs and human expert performance. Additionally, we +fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable +improvements. We hope this work could illuminate the avenue of advancing the +use of LVLMs in medical diagnostics by emphasizing the importance of historical +context in interpreting medical images. Our dataset is released at +\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU}. -摘要:元認知教育在培養學生的自我調節和反思性思考中發揮著至關重要的作用,通過學術諮詢為有學習困難的人提供必要的支持。使用大型語言模型模擬學習能力不足的學生提供了一種有前途的方法,可以在沒有道德問題的情況下改進教學方法。然而,現有的模擬通常無法真實地反映學生的學習困難,並且由於缺乏可靠的指標和數據收集中的道德約束,在評估中面臨挑戰。為了解決這些問題,我們提出了一個自動生成和過濾高質量模擬學生代理的管道。我們的做法利用了由人類專家驗證的兩輪自動評分系統,並採用分數傳播模組來獲得跨學生圖表更一致的分數。實驗結果表明,我們的管道有效地識別了高質量的學生代理,並且我們討論了影響模擬效果的特質。通過模擬具有不同程度學習困難的學生,我們的研究為個性化學習和教育評估中的更廣泛應用鋪平了道路。 +摘要:大型視覺語言模型 (LVLMs) 已在醫療應用中展現出極大的潛力,特別是在視覺問答 (MedVQA) 和醫學影像診斷方面。然而,現有的資料集和模型常常無法考量醫療診斷的關鍵層面,例如病歷整合以及隨著時間推移對疾病進程的分析。在本文中,我們介紹 MMXU(多模態多 X 光理解),一個專注於識別兩次患者就診之間特定區域變化的 MedVQA 新資料集。與主要處理單一影像問題的先前資料集不同,MMXU 支援多影像問題,同時納入當前和病史患者資料。我們展示了現有 LVLMs 在 MMXU-\textit{test} 中識別疾病進程的限制,即使是在傳統基準測試中表現良好的 LVLMs 也是如此。為了解決這個問題,我們提出了一個病歷增強生成 (MAG) 方法,結合了全域和區域病史。我們的實驗顯示,整合病歷可顯著提升至少 20% 的診斷準確度,縮小了現有 LVLMs 和人類專家表現之間的差距。此外,我們在 MMXU-\textit{dev} 上微調帶有 MAG 的模型,這展示了顯著的進步。我們希望這項工作能透過強調病史脈絡在解讀醫學影像中的重要性,為推進 LVLMs 在醫療診斷中的應用開闢道路。我們的資料集已於\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU} 發布。 + +##### **A Survey of Personalized Large Language Models: Progress and Future Directions** +2502.11528v1 by Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Jieming Zhu, Minda Hu, Menglin Yang, Irwin King + +Large Language Models (LLMs) excel in handling general knowledge tasks, yet +they struggle with user-specific personalization, such as understanding +individual emotions, writing styles, and preferences. Personalized Large +Language Models (PLLMs) tackle these challenges by leveraging individual user +data, such as user profiles, historical dialogues, content, and interactions, +to deliver responses that are contextually relevant and tailored to each user's +specific needs. This is a highly valuable research topic, as PLLMs can +significantly enhance user satisfaction and have broad applications in +conversational agents, recommendation systems, emotion recognition, medical +assistants, and more. This survey reviews recent advancements in PLLMs from +three technical perspectives: prompting for personalized context (input level), +finetuning for personalized adapters (model level), and alignment for +personalized preferences (objective level). To provide deeper insights, we also +discuss current limitations and outline several promising directions for future +research. Updated information about this survey can be found at the +https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models. + +摘要:大型語言模型 (LLM) 在處理一般知識任務方面表現出色,但 +它們在使用者特定的個人化方面有困難,例如理解 +個別的情緒、寫作風格和偏好。個人化大型 +語言模型 (PLLM) 透過利用個別使用者的 +資料來解決這些挑戰,例如使用者個人資料、歷史對話、內容和互動, +提供在脈絡上相關且針對每個使用者的特定需求量身打造的回應。這是一個非常有價值的研究主題,因為 PLLM 可以 +顯著提升使用者滿意度,並在對話代理、推薦系統、情緒辨識、醫療 +助理等方面有廣泛的應用。這項調查從三個技術觀點回顧 PLLM 的最新進展:提示個人化脈絡(輸入層級)、微調個人化適配器(模型層級),以及對齊個人化偏好(目標層級)。為了提供更深入的見解,我們也 +討論目前的限制,並概述未來研究的幾個有希望的方向。這項調查的最新資訊可以在 +https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models 找到。 -##### **Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering** -2502.11491v1 by Runxuan Liu, Bei Luo, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin +##### **Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos** +2502.11481v1 by Xiangxiang Cui, Zhongyu Li, Xiayue Fan, Peng Huang, Ying Wang, Meng Yang, Shi Chang, Jihua Zhu -Large language models (LLMs) have shown remarkable capabilities in natural -language processing. However, in knowledge graph question answering tasks -(KGQA), there remains the issue of answering questions that require multi-hop -reasoning. Existing methods rely on entity vector matching, but the purpose of -the question is abstract and difficult to match with specific entities. As a -result, it is difficult to establish reasoning paths to the purpose, which -leads to information loss and redundancy. To address this issue, inspired by -human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a -novel framework that constructs reasoning paths from purposes back to -conditions. ORT operates in three key phases: (1) using LLM to extract purpose -labels and condition labels, (2) constructing label reasoning paths based on -the KG ontology, and (3) using the label reasoning paths to guide knowledge -retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves -state-of-the-art performance and significantly enhances the capability of LLMs -for KGQA. +The intersection of medical imaging and artificial intelligence has become an +important research direction in intelligent medical treatment, particularly in +the analysis of medical images using deep learning for clinical diagnosis. +Despite the advances, existing keyframe classification methods lack extraction +of time series features, while ultrasonic video classification based on +three-dimensional convolution requires uniform frame numbers across patients, +resulting in poor feature extraction efficiency and model classification +performance. This study proposes a novel video classification method based on +CNN and LSTM, introducing NLP's long and short sentence processing scheme into +video classification for the first time. The method reduces CNN-extracted image +features to 1x512 dimension, followed by sorting and compressing feature +vectors for LSTM training. Specifically, feature vectors are sorted by patient +video frame numbers and populated with padding value 0 to form variable +batches, with invalid padding values compressed before LSTM training to +conserve computing resources. Experimental results demonstrate that our +variable-frame CNNLSTM method outperforms other approaches across all metrics, +showing improvements of 3-6% in F1 score and 1.5% in specificity compared to +keyframe methods. The variable-frame CNNLSTM also achieves better accuracy and +precision than equal-frame CNNLSTM. These findings validate the effectiveness +of our approach in classifying variable-frame ultrasound videos and suggest +potential applications in other medical imaging modalities. -摘要:大型語言模型 (LLM) 在自然語言處理中展現出卓越的能力。然而,在知識圖譜問答任務 (KGQA) 中,仍然存在需要多跳推理才能回答問題的問題。現有方法依賴於實體向量匹配,但問題的目的是抽象的,難以與特定實體匹配。因此,很難建立推理路徑來達成目的,這會導致資訊遺失和冗餘。為了解決這個問題,在人類逆向思維的啟發下,我們提出了基於本体的逆向思維 (ORT),這是一個創新的架構,可以從目的建構推理路徑,再回推到條件。ORT 運作在三個關鍵階段:(1) 使用 LLM 萃取目的標籤和條件標籤,(2) 基於 KG 本体建構標籤推理路徑,以及 (3) 使用標籤推理路徑來引導知識擷取。在 WebQSP 和 CWQ 資料集上的實驗顯示,ORT 達到了最先進的效能,並顯著增強了 LLM 對 KGQA 的能力。 +摘要:醫學影像與人工智慧的交叉領域已成為智慧醫療的重要研究方向,特別是在臨床診斷中使用深度學習分析醫學影像。儘管有進展,現有的關鍵影格分類方法缺乏時間序列特徵的提取,而基於三維卷積的超音波影片分類需要患者之間的均勻影格數,導致特徵提取效率差和模型分類效能不佳。本研究提出了一種基於 CNN 和 LSTM 的新影片分類方法,首次將 NLP 的長短句處理機制引入影片分類中。該方法將 CNN 提取的影像特徵縮減為 1x512 維度,然後對特徵向量進行排序和壓縮以進行 LSTM 訓練。具體來說,特徵向量按患者影片影格數排序,並填充 0 補齊值以形成可變批次,在 LSTM 訓練前壓縮無效的補齊值以節省運算資源。實驗結果表明,我們的可變影格 CNNLSTM 方法在所有指標上都優於其他方法,與關鍵影格方法相比,F1 分數提高了 3-6%,特異性提高了 1.5%。可變影格 CNNLSTM 也比等影格 CNNLSTM 達到了更好的準確度和精確度。這些發現驗證了我們的方法在分類可變影格超音波影片中的有效性,並表明在其他醫學影像模式中具有潛在的應用。 -##### **GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion** -2502.11471v1 by Kangyang Luo, Yuzhuo Bai, Cheng Gao, Shuzheng Si, Yingli Shen, Zhu Liu, Zhitong Wang, Cunliang Kong, Wenhao Li, Yufei Huang, Ye Tian, Xuantang Xiong, Lei Han, Maosong Sun +##### **Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation** +2502.11456v1 by Yanyan Wang, Kechen Song, Yuyuan Liu, Shuai Ma, Yunhui Yan, Gustavo Carneiro -Knowledge Graph Completion (KGC), which aims to infer missing or incomplete -facts, is a crucial task for KGs. However, integrating the vital structural -information of KGs into Large Language Models (LLMs) and outputting predictions -deterministically remains challenging. To address this, we propose a new method -called GLTW, which encodes the structural information of KGs and merges it with -LLMs to enhance KGC performance. Specifically, we introduce an improved Graph -Transformer (iGT) that effectively encodes subgraphs with both local and global -structural information and inherits the characteristics of language model, -bypassing training from scratch. Also, we develop a subgraph-based -multi-classification training objective, using all entities within KG as -classification objects, to boost learning efficiency.Importantly, we combine -iGT with an LLM that takes KG language prompts as input.Our extensive -experiments on various KG datasets show that GLTW achieves significant -performance gains compared to SOTA baselines. +Semi-supervised 3D medical image segmentation aims to achieve accurate +segmentation using few labelled data and numerous unlabelled data. The main +challenge in the design of semi-supervised learning methods consists in the +effective use of the unlabelled data for training. A promising solution +consists of ensuring consistent predictions across different views of the data, +where the efficacy of this strategy depends on the accuracy of the +pseudo-labels generated by the model for this consistency learning strategy. In +this paper, we introduce a new methodology to produce high-quality +pseudo-labels for a consistency learning strategy to address semi-supervised 3D +medical image segmentation. The methodology has three important contributions. +The first contribution is the Cooperative Rectification Learning Network (CRLN) +that learns multiple prototypes per class to be used as external knowledge +priors to adaptively rectify pseudo-labels at the voxel level. The second +contribution consists of the Dynamic Interaction Module (DIM) to facilitate +pairwise and cross-class interactions between prototypes and multi-resolution +image features, enabling the production of accurate voxel-level clues for +pseudo-label rectification. The third contribution is the Cooperative Positive +Supervision (CPS), which optimises uncertain representations to align with +unassertive representations of their class distributions, improving the model's +accuracy in classifying uncertain regions. Extensive experiments on three +public 3D medical segmentation datasets demonstrate the effectiveness and +superiority of our semi-supervised learning method. -摘要:知識圖譜補全 (KGC) 旨在推論遺失或不完整的 -事實,是 KGs 的一項關鍵任務。然而,將 KGs 的重要結構 -資訊整合至大型語言模型 (LLM),並確定性地輸出預測結果,仍然是一項挑戰。為了解決這個問題,我們提出了一種新的方法,稱為 GLTW,它編碼了 KGs 的結構資訊,並將其與 LLM 合併,以增強 KGC 的效能。具體來說,我們引進了一個改良的圖形轉換器 (iGT),它能有效地編碼具有局部和全域結構資訊的子圖,並繼承語言模型的特徵,繞過從頭開始的訓練。此外,我們開發了一個基於子圖的多分類訓練目標,使用 KG 中的所有實體作為 -分類物件,以提升學習效率。重要的是,我們將 iGT 與一個將 KG 語言提示作為輸入的 LLM 結合起來。我們在各種 KG 資料集上進行的廣泛實驗顯示,與 SOTA 基準線相比,GLTW 獲得了顯著的效能提升。 +摘要:半监督 3D 医学影像分割旨在使用少量标记数据和大量未标记数据实现精确分割。半监督学习方法设计中的主要挑战在于有效使用未标记数据进行训练。一个有前景的解决方案是确保数据不同视图之间预测的一致性,其中此策略的有效性取决于模型为这种一致性学习策略生成的伪标签的准确性。在本文中,我们引入了一种新的方法来为一致性学习策略生成高质量的伪标签,以解决半监督 3D 医学图像分割问题。该方法有三个重要的贡献。第一个贡献是协作修正学习网络 (CRLN),它为每个类别学习多个原型,用作外部知识先验,以在体素级别自适应地修正伪标签。第二个贡献包括动态交互模块 (DIM),以促进原型和多分辨率图像特征之间的成对和跨类交互,从而能够生成用于伪标签修正的准确体素级线索。第三个贡献是协作正监督 (CPS),它优化不确定的表示以与其类分布的不确定表示保持一致,从而提高模型对不确定区域进行分类的准确性。在三个公共 3D 医学分割数据集上进行的大量实验表明了我们半监督学习方法的有效性和优越性。 -##### **Large Language-Geometry Model: When LLM meets Equivariance** -2502.11149v2 by Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao +##### **A Survey of LLM-based Agents in Medicine: How far are we from Baymax?** +2502.11211v1 by Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Wenting Chen, Xiang Li, Yixuan Yuan -Accurately predicting 3D structures and dynamics of physical systems is -crucial in scientific applications. Existing approaches that rely on geometric -Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, -but they often fall in leveraging extensive broader information. While direct -application of Large Language Models (LLMs) can incorporate external knowledge, -they lack the capability for spatial reasoning with guaranteed equivariance. In -this paper, we propose EquiLLM, a novel framework for representing 3D physical -systems that seamlessly integrates E(3)-equivariance with LLM capabilities. -Specifically, EquiLLM comprises four key components: geometry-aware prompting, -an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the -LLM guided by the instructive prompt serves as a sophisticated invariant -feature processor, while 3D directional information is exclusively handled by -the equivariant encoder and adaptor modules. Experimental results demonstrate -that EquiLLM delivers significant improvements over previous methods across -molecular dynamics simulation, human motion simulation, and antibody design, -highlighting its promising generalizability. +Large Language Models (LLMs) are transforming healthcare through the +development of LLM-based agents that can understand, reason about, and assist +with medical tasks. This survey provides a comprehensive review of LLM-based +agents in medicine, examining their architectures, applications, and +challenges. We analyze the key components of medical agent systems, including +system profiles, clinical planning mechanisms, medical reasoning frameworks, +and external capacity enhancement. The survey covers major application +scenarios such as clinical decision support, medical documentation, training +simulations, and healthcare service optimization. We discuss evaluation +frameworks and metrics used to assess these agents' performance in healthcare +settings. While LLM-based agents show promise in enhancing healthcare delivery, +several challenges remain, including hallucination management, multimodal +integration, implementation barriers, and ethical considerations. The survey +concludes by highlighting future research directions, including advances in +medical reasoning inspired by recent developments in LLM architectures, +integration with physical systems, and improvements in training simulations. +This work provides researchers and practitioners with a structured overview of +the current state and future prospects of LLM-based agents in medicine. -摘要:準確預測物理系統的 3D 結構和動力學在科學應用中至關重要。現有依賴於幾何圖神經網路 (GNN) 的方法有效地強制執行了 $\mathrm{E}(3)$-等變性,但它們通常無法利用廣泛的更廣泛資訊。儘管大型語言模型 (LLM) 的直接應用可以納入外部知識,但它們缺乏保證等變性的空間推理能力。在本文中,我們提出了 EquiLLM,一個用於表示 3D 物理系統的新框架,它將 E(3)-等變性與 LLM 能力無縫整合。具體來說,EquiLLM 包含四個關鍵組成部分:感知幾何的提示、等變編碼器、LLM 和等變適配器。從本質上講,由指導性提示引導的 LLM 作為一個複雜的不變特徵處理器,而 3D 方向資訊則由等變編碼器和適配器模組獨家處理。實驗結果表明,EquiLLM 在分子動力學模擬、人類運動模擬和抗體設計方面比以前的方法有了顯著的改進,突顯了其有希望的泛化能力。 +摘要:大型語言模型 (LLM) 透過開發可理解、推理並協助醫療任務的 LLM 基礎代理人,轉變了醫療保健。本調查提供了 LLM 基礎代理人在醫學中的全面回顧,探討其架構、應用和挑戰。我們分析了醫療代理系統的主要組成部分,包括系統概況、臨床規劃機制、醫療推理架構和外部能力提升。本調查涵蓋了主要的應用場景,例如臨床決策支援、醫療文件、訓練模擬和醫療保健服務最佳化。我們討論了用於評估這些代理人在醫療保健環境中表現的評估架構和指標。雖然 LLM 基礎代理人顯示出在增強醫療保健提供方面的潛力,但仍有許多挑戰,包括幻覺管理、多模態整合、實施障礙和倫理考量。本調查最後強調了未來的研究方向,包括受 LLM 架構近期發展啟發的醫療推理進展、與物理系統的整合和訓練模擬的改進。這項工作為研究人員和從業人員提供了 LLM 基礎代理人在醫學中當前狀態和未來前景的結構化概觀。 -##### **Beyond Pairwise: Global Zero-shot Temporal Graph Generation** -2502.11114v1 by Alon Eirew, Kfir Bar, Ido Dagan +##### **RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer** +2502.11179v1 by Shilong Yang, Qi Zang, Chulong Zhang, Lingfeng Huang, Yaoqin Xie -Temporal relation extraction (TRE) is a fundamental task in natural language -processing (NLP) that involves identifying the temporal relationships between -events in a document. Despite the advances in large language models (LLMs), -their application to TRE remains limited. Most existing approaches rely on -pairwise classification, in which event pairs are considered individually, -leading to computational inefficiency and a lack of global consistency in the -resulting temporal graph. In this work, we propose a novel zero-shot method for -TRE that generates a document's complete temporal graph at once, then applies -transitive constraints optimization to refine predictions and enforce temporal -consistency across relations. Additionally, we introduce OmniTemp, a new -dataset with complete annotations for all pairs of targeted events within a -document. Through experiments and analyses, we demonstrate that our method -significantly outperforms existing zero-shot approaches while achieving -competitive performance with supervised models. +Traditional Chinese acupuncture methods often face controversy in clinical +practice due to their high subjectivity. Additionally, current +intelligent-assisted acupuncture systems have two major limitations: slow +acupoint localization speed and low accuracy. To address these limitations, a +new method leverages the excellent inference efficiency of the state-space +model Mamba, while retaining the advantages of the attention mechanism in the +traditional DETR architecture, to achieve efficient global information +integration and provide high-quality feature information for acupoint +localization tasks. Furthermore, by employing the concept of residual +likelihood estimation, it eliminates the need for complex upsampling processes, +thereby accelerating the acupoint localization task. Our method achieved +state-of-the-art (SOTA) accuracy on a private dataset of acupoints on the human +back, with an average Euclidean distance pixel error (EPE) of 7.792 and an +average time consumption of 10.05 milliseconds per localization task. Compared +to the second-best algorithm, our method improved both accuracy and speed by +approximately 14\%. This significant advancement not only enhances the efficacy +of acupuncture treatment but also demonstrates the commercial potential of +automated acupuncture robot systems. Access to our method is available at +https://github.com/Sohyu1/RT-DEMT -摘要:時間關係抽取 (TRE) 是自然語言處理 (NLP) 中的一項基本任務,涉及識別文件中事件之間的時間關係。儘管大型語言模型 (LLM) 取得進展,但它們在 TRE 中的應用仍然有限。現有的大多數方法依賴於成對分類,其中事件對被單獨考慮,導致計算效率低下且在生成的時序圖中缺乏全局一致性。在這項工作中,我們提出了一種新穎的 TRE 零次學習方法,它可以一次生成文件的完整時序圖,然後應用遞移約束最佳化來優化預測並強制關係之間的時間一致性。此外,我們引入了 OmniTemp,這是一個新的數據集,其中包含文件內所有目標事件對的完整註解。通過實驗和分析,我們證明了我們的方法明顯優於現有的零次學習方法,同時實現了與監督模型相當的性能。 +摘要:傳統的中醫針灸方法由於其高度主觀性,在臨床實務中經常面臨爭議。此外,現有的智慧輔助針灸系統有兩大限制:取穴速度慢以及準確度低。為了解決這些限制,一種新的方法利用了狀態空間模型 Mamba 優異的推理效率,同時保留了傳統 DETR 架構中注意力機制的優點,以實現高效的全局資訊整合,並為取穴任務提供高品質的特徵資訊。此外,透過採用殘差似然估計的概念,它消除了對複雜上採樣程序的需求,從而加速了取穴任務。我們的模型在人體背部穴位私人資料集上達到了最先進 (SOTA) 的準確度,平均歐幾里得距離像素誤差 (EPE) 為 7.792,平均每個取穴任務耗時 10.05 毫秒。與第二好的演算法相比,我們的模型在準確度和速度上都提高了大約 14%。這項重大進展不僅提高了針灸治療的療效,也證明了自動化針灸機器人系統的商業潛力。我們的模型可以在 https://github.com/Sohyu1/RT-DEMT 取得 ##### **Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications** 2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy @@ -8621,156 +8606,180 @@ chatbot applications. 摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。 -##### **Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection** -2502.11062v1 by Yang Zhao, Li Du, Xiao Ding, Yangou Ouyang, Hepeng Wang, Kai Xiong, Jinglong Gao, Zhouhao Sun, Dongliang Xu, Yang Qing, Dongchen Li, Bing Qin, Ting Liu +##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration** +2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang -Large language models (LLMs) have shown great potential across various -industries due to their remarkable ability to generalize through instruction -tuning. However, the limited availability of domain-specific data significantly -hampers their performance on specialized tasks. While existing methods -primarily focus on selecting training data from general datasets that are -similar to the target domain, they often fail to consider the joint -distribution of instructions, resulting in inefficient learning and suboptimal -knowledge transfer. To address these challenges, we introduce G2IS -(Gradient-based Graph Instruction Selection), a novel method that constructs a -mixed gradient-based instruction graph to capture the joint distribution and -interdependencies between instructions. By accounting for the relationships -between instructions, G2IS improves domain adaptation efficiency. Additionally, -we propose a gradient walk algorithm to refine the data selection process, -enhancing both training effectiveness and efficiency. Our experiments -demonstrate that G2IS outperforms traditional methods across various domain -adaptation tasks, yielding significant performance gains, particularly in -complex, data-scarce scenarios. These results underscore the potential of G2IS -in advancing the development of large, domain-specific models. +Automatic depression detection provides cues for early clinical intervention +by clinicians. Clinical interviews for depression detection involve dialogues +centered around multiple themes. Existing studies primarily design end-to-end +neural network models to capture the hierarchical structure of clinical +interview dialogues. However, these methods exhibit defects in modeling the +thematic content of clinical interviews: 1) they fail to capture intra-theme +and inter-theme correlation explicitly, and 2) they do not allow clinicians to +intervene and focus on themes of interest. To address these issues, this paper +introduces an interactive depression detection framework. This framework +leverages in-context learning techniques to identify themes in clinical +interviews and then models both intra-theme and inter-theme correlation. +Additionally, it employs AI-driven feedback to simulate the interests of +clinicians, enabling interactive adjustment of theme importance. PDIMC achieves +absolute improvements of 35\% and 12\% compared to the state-of-the-art on the +depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of +modeling theme correlation and incorporating interactive external feedback. -摘要:大型語言模型 (LLM) 因其透過指令微調而具備的卓越泛化能力,在各產業中展現出極大的潛力。然而,特定領域資料的取得有限,大幅影響其在專業任務上的表現。現有方法主要專注於從與目標領域類似的通用資料集中選取訓練資料,但它們通常未能考量指令的聯合分佈,導致學習效率不彰且知識傳遞不佳。為了應對這些挑戰,我們引進 G2IS(基於梯度的圖形指令選取),這是一種創新的方法,可建構一個混合的基於梯度的指令圖形,以擷取指令之間的聯合分佈和相互依賴性。透過考量指令之間的關係,G2IS 提升了領域適應的效率。此外,我們提出了一種梯度漫步演算法來優化資料選取程序,同時提升訓練效能和效率。我們的實驗證明,G2IS 在各種領域適應任務中優於傳統方法,產生顯著的效能提升,特別是在資料稀少的複雜場景中。這些結果突顯了 G2IS 在推動大型特定領域模型發展方面的潛力。 +摘要:自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而,這些方法在建模臨床訪談的主題內容時表現出缺陷:1)它們無法明確捕捉主題內和主題間的關聯性,以及 2)它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題,本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題,然後對主題內和主題間的關聯性進行建模。此外,它採用 AI 驅動的回饋來模擬臨床醫師的興趣,實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比,PDIMC 的絕對改進率分別為 35% 和 12%,這證明了對主題關聯性建模和納入互動式外部回饋的有效性。 -##### **CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models** -2502.11008v1 by Yuefei Chen, Vivek K. Singh, Jing Ma, Ruxiang Tang +##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening** +2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu -Counterfactual reasoning is widely recognized as one of the most challenging -and intricate aspects of causality in artificial intelligence. In this paper, -we evaluate the performance of large language models (LLMs) in counterfactual -reasoning. In contrast to previous studies that primarily focus on commonsense -causal reasoning, where LLMs often rely on prior knowledge for inference, we -specifically assess their ability to perform counterfactual inference using a -set of formal rules. To support this evaluation, we introduce a new benchmark -dataset, CounterBench, comprising 1K counterfactual reasoning questions. The -dataset is designed with varying levels of difficulty, diverse causal graph -structures, distinct types of counterfactual questions, and multiple -nonsensical name variants. Our experiments demonstrate that counterfactual -reasoning poses a significant challenge for LLMs, with most models performing -at levels comparable to random guessing. To enhance LLM's counterfactual -reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides -LLMs through iterative reasoning and backtracking to systematically explore -counterfactual solutions. Experimental results show that our method -significantly improves LLM performance on counterfactual reasoning tasks and -consistently enhances performance across different LLMs.Our dataset is -available at https://huggingface.co/datasets/CounterBench/CounterBench. +Due to the rise in antimicrobial resistance, identifying novel compounds with +antibiotic potential is crucial for combatting this global health issue. +However, traditional drug development methods are costly and inefficient. +Recognizing the pressing need for more effective solutions, researchers have +turned to machine learning techniques to streamline the prediction and +development of novel antibiotic compounds. While foundation models have shown +promise in antibiotic discovery, current mainstream efforts still fall short of +fully leveraging the potential of multimodal molecular data. Recent studies +suggest that contrastive learning frameworks utilizing multimodal data exhibit +excellent performance in representation learning across various domains. +Building upon this, we introduce CL-MFAP, an unsupervised contrastive learning +(CL)-based multimodal foundation (MF) model specifically tailored for +discovering small molecules with potential antibiotic properties (AP) using +three types of molecular data. This model employs 1.6 million bioactive +molecules with drug-like properties from the ChEMBL dataset to jointly pretrain +three encoders: (1) a transformer-based encoder with rotary position embedding +for processing SMILES strings; (2) another transformer-based encoder, +incorporating a novel bi-level routing attention mechanism to handle molecular +graph representations; and (3) a Morgan fingerprint encoder using a multilayer +perceptron, to achieve the contrastive learning purpose. The CL-MFAP +outperforms baseline models in antibiotic property prediction by effectively +utilizing different molecular modalities and demonstrates superior +domain-specific performance when fine-tuned for antibiotic-related property +prediction tasks. -摘要:反事實推理被廣泛認為是人工智慧中因果關係最具挑戰性和複雜的面向之一。在本文中,我們評估大型語言模型 (LLM) 在反事實推理中的表現。與主要關注常識因果推理,其中 LLM 經常依賴先驗知識來進行推理的先前研究不同,我們特別評估它們使用一組形式規則執行反事實推理的能力。為了支持此評估,我們引入了一個新的基準資料集 CounterBench,其中包含 1K 個反事實推理問題。資料集的設計具有不同的難度等級、多樣化的因果圖結構、不同類型的反事實問題和多種無意義的名稱變體。我們的實驗表明,反事實推理對 LLM 構成重大挑戰,大多數模型的表現與隨機猜測相當。為了增強 LLM 的反事實推理能力,我們提出了一種新穎的推理範例 CoIn,它引導 LLM 透過反覆推理和回溯系統性地探索反事實解。實驗結果表明,我們的方法顯著提升 LLM 在反事實推理任務上的表現,並持續增強不同 LLM 的表現。我們的資料集可在 https://huggingface.co/datasets/CounterBench/CounterBench 取得。 +摘要:由於抗菌藥物抗性上升,找出具有抗生素潛力的新型化合物對於對抗此項全球性健康議題至關重要。不過,傳統的藥物開發方法成本高昂且效率不彰。研究人員體認到對於更有效解決方案的迫切需求,因此轉向機器學習技術來簡化新型抗生素化合物的預測和開發。儘管基礎模型在抗生素發現方面展現潛力,目前的普遍做法仍未充分利用多模態分子資料的潛力。最近的研究顯示,利用多模態資料的對比學習架構在各種領域的表徵學習中展現出優異的效能。有鑑於此,我們引進 CL-MFAP,一種無監督對比學習 (CL) 為基礎的多模態基礎 (MF) 模型,專門用於使用三種類型的分子資料發現具有潛在抗生素特性的低分子。此模型採用 ChEMBL 資料集中的 160 萬個具有類藥物特性的生物活性分子,以聯合預訓練三個編碼器:(1) 一個具有旋轉位置嵌入的基於Transformer的編碼器,用於處理 SMILES 字串;(2) 另一個基於Transformer的編碼器,結合一種新穎的雙層路由注意機制來處理分子圖表表徵;以及 (3) 一個使用多層感知器的 Morgan 指紋編碼器,以達成對比學習的目的。CL-MFAP 透過有效利用不同的分子模式在抗生素特性預測方面優於基準模型,並且在針對抗生素相關特性預測任務進行微調時展現出優異的特定領域效能。 -##### **RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation** -2502.10996v1 by Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han +##### **Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images** +2502.10908v1 by Sevim Cengiz, Ibraheem Hamdi, Mohammad Yaqub -Retrieval-augmented language models often struggle with knowledge-intensive -tasks due to inefficient retrieval, unstructured knowledge integration, and -single-pass architectures. We present Retrieval-And-Structuring (RAS), a novel -framework that dynamically constructs and reasons over query-specific knowledge -graphs through iterative retrieval and structuring. RAS introduces four key -technical innovations: (1) a themescoped retrieval mechanism that efficiently -narrows the search space while maintaining retrieval quality, (2) an action -planning module that determines knowledge needs and generates focused -sub-queries, (3) a dynamic knowledge structuring approach that converts -retrieved text into an evolving knowledge graph, and (4) a graph-augmented -answering component that leverages the accumulated structured information. Our -framework achieves state-of-the-art performance, surpassing leading baselines -by 6.4% with open-source language models and 7.0% with proprietary models on -seven knowledge-intensive generation datasets across all evaluation metrics. -Detailed ablation studies verify the contribution of each technical component -to the overall system performance. +Fetal gestational age (GA) is vital clinical information that is estimated +during pregnancy in order to assess fetal growth. This is usually performed by +measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan +which is then correlated with fetal age and growth trajectory. A major issue +when performing the CRL measurement is ensuring that the image is acquired at +the correct view, otherwise it could be misleading. Although clinical +guidelines specify the criteria for the correct CRL view, sonographers may not +regularly adhere to such rules. In this paper, we propose a new deep +learning-based solution that is able to verify the adherence of a CRL image to +clinical guidelines in order to assess image quality and facilitate accurate +estimation of GA. We first segment out important fetal structures then use the +localized structures to perform a clinically-guided mapping that verifies the +adherence of criteria. The segmentation method combines the benefits of +Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment +fetal structures in ultrasound images and localize important fetal landmarks. +For segmentation purposes, we compare our proposed work with UNet and show that +our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, +we compare the output of the mapping with classification CNNs when assessing +the clinical criteria and the overall acceptability of CRL images. We show that +the proposed mapping is not only explainable but also more accurate than the +best performing classification CNNs. -摘要:检索增强语言模型通常会因检索效率低、知识整合无结构和单次通过架构而难以胜任知识密集型任务。我们提出检索和结构化 (RAS),这是一个新颖的框架,通过迭代检索和结构化,动态构建和推理特定于查询的知识图谱。RAS 引入了四项关键技术创新:(1) 主题范围检索机制,在保持检索质量的同时有效缩小搜索空间,(2) 动作规划模块,确定知识需求并生成重点子查询,(3) 动态知识结构化方法,将检索到的文本转换为不断发展的知识图谱,以及 (4) 图谱增强型回答组件,利用累积的结构化信息。我们的框架实现了最先进的性能,在七个知识密集型生成数据集上,使用开源语言模型提高了 6.4%,使用专有模型提高了 7.0%,超越了领先的基线,且所有评估指标均如此。详细的消融研究验证了每个技术组件对整体系统性能的贡献。 +摘要:胎兒妊娠年齡 (GA) 是重要的臨床資訊,會在懷孕期間估計,以評估胎兒生長。這通常是透過在約會掃描中測量超音波影像中的頭臀長度 (CRL) 來執行,然後與胎兒年齡和生長軌跡相關聯。執行 CRL 測量時的一個主要問題是確保影像是在正確的視角下取得,否則可能會產生誤導。儘管臨床指南規定了正確 CRL 視角的標準,但超音波檢查員可能不會定期遵守這些規則。在本文中,我們提出了一個新的深度學習解決方案,能夠驗證 CRL 影像是否符合臨床指南,以評估影像品質並促進對 GA 的準確估計。我們首先分割出重要的胎兒結構,然後使用局部結構來執行臨床指導的對應,以驗證標準的遵守情況。分割方法結合了卷積神經網路 (CNN) 和視覺轉換器 (ViT) 的優點,以分割超音波影像中的胎兒結構並定位重要的胎兒標誌。為了分割目的,我們將我們提出的工作與 UNet 進行比較,並顯示我們基於 CNN/ViT 的方法優於 UNet 的最佳化版本。此外,我們在評估臨床標準和 CRL 影像的整體可接受性時,將對應的輸出與分類 CNN 進行比較。我們表明,所提出的對應不僅可以解釋,而且比效能最佳的分類 CNN 更準確。 -##### **Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia** -2502.10896v1 by Rohith Perumandla, Young-Ho Bae, Diego Izaguirre, Esther Hwang, Andrew Murphy, Long-Jing Hsu, Selma Sabanovic, Casey C. Bennett +##### **Breaking Down the Hierarchy: A New Approach to Leukemia Classification** +2502.10899v1 by Ibraheem Hamdi, Hosam El-Gendy, Ahmed Sharshar, Mohamed Saeed, Muhammad Ridzuan, Shahrukh K. Hashmi, Naveed Syed, Imran Mirza, Shakir Hussain, Amira Mahmoud Abdalla, Mohammad Yaqub -This study presents the development and testing of a conversational speech -system designed for robots to detect speech biomarkers indicative of cognitive -impairments in people living with dementia (PLwD). The system integrates a -backend Python WebSocket server and a central core module with a large language -model (LLM) fine-tuned for dementia to process user input and generate robotic -conversation responses in real-time in less than 1.5 seconds. The frontend user -interface, a Progressive Web App (PWA), displays information and biomarker -score graphs on a smartphone in real-time to human users (PLwD, caregivers, -clinicians). Six speech biomarkers based on the existing literature - Altered -Grammar, Pragmatic Impairments, Anomia, Disrupted Turn-Taking, Slurred -Pronunciation, and Prosody Changes - were developed for the robot conversation -system using two datasets, one that included conversations of PLwD with a human -clinician (DementiaBank dataset) and one that included conversations of PLwD -with a robot (Indiana dataset). We also created a composite speech biomarker -that combined all six individual biomarkers into a single score. The speech -system's performance was first evaluated on the DementiaBank dataset showing -moderate correlation with MMSE scores, with the composite biomarker score -outperforming individual biomarkers. Analysis of the Indiana dataset revealed -higher and more variable biomarker scores, suggesting potential differences due -to study populations (e.g. severity of dementia) and the conversational -scenario (human-robot conversations are different from human-human). The -findings underscore the need for further research on the impact of -conversational scenarios on speech biomarkers and the potential clinical -applications of robotic speech systems. +The complexities inherent to leukemia, multifaceted cancer affecting white +blood cells, pose considerable diagnostic and treatment challenges, primarily +due to reliance on laborious morphological analyses and expert judgment that +are susceptible to errors. Addressing these challenges, this study presents a +refined, comprehensive strategy leveraging advanced deep-learning techniques +for the classification of leukemia subtypes. We commence by developing a +hierarchical label taxonomy, paving the way for differentiating between various +subtypes of leukemia. The research further introduces a novel hierarchical +approach inspired by clinical procedures capable of accurately classifying +diverse types of leukemia alongside reactive and healthy cells. An integral +part of this study involves a meticulous examination of the performance of +Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) as +classifiers. The proposed method exhibits an impressive success rate, achieving +approximately 90\% accuracy across all leukemia subtypes, as substantiated by +our experimental results. A visual representation of the experimental findings +is provided to enhance the model's explainability and aid in understanding the +classification process. -摘要:本研究展示了對話式語音系統的開發和測試,該系統專為機器人設計,用於偵測失智症患者(PLwD)認知障礙的語言生物標記。該系統整合了後端 Python WebSocket 伺服器和一個中央核心模組,其中包含針對失智症微調的大語言模型(LLM),以處理使用者輸入並在不到 1.5 秒的時間內產生機器人對話回應。前端使用者介面(漸進式網路應用程式,PWA)會在智慧型手機上即時向人類使用者(PLwD、照護者、臨床醫生)顯示資訊和生物標記評分圖表。根據現有文獻,針對機器人對話系統開發了六個語言生物標記:語法改變、實用障礙、失語症、輪流中斷、發音不清和韻律變化,使用了兩個資料集,一個包含 PLwD 與人類臨床醫生對話(DementiaBank 資料集),另一個包含 PLwD 與機器人對話(Indiana 資料集)。我們還建立了一個複合語言生物標記,將所有六個個別生物標記組合成一個單一評分。語言系統的效能首先在 DementiaBank 資料集上進行評估,顯示與 MMSE 評分有中等相關性,複合生物標記評分優於個別生物標記。對 Indiana 資料集的分析顯示出較高且變異性較大的生物標記評分,這表明由於研究族群(例如失智症的嚴重程度)和對話情境(人機對話與人際對話不同)而產生潛在差異。研究結果強調需要進一步研究對話情境對語言生物標記的影響,以及機器人語言系統的潛在臨床應用。 +摘要:白血病的复杂性源于它是一种影响白血球的多面性癌症,主要由于依赖费力的形态分析和容易出错的专家判断,因此带来了相当大的诊断和治疗挑战。为了应对这些挑战,本研究提出了一种精细且全面的策略,利用先进的深度学习技术对白血病亚型进行分类。我们首先开发了一个分层的标签分类法,为区分白血病的各种亚型铺平了道路。该研究进一步引入了一种新颖的分层方法,该方法受临床程序的启发,能够准确地对各种类型的白血病以及反应性和健康细胞进行分类。本研究的一个组成部分涉及对卷积神经网络 (CNN) 和视觉变压器 (ViT) 作为分类器的性能进行细致检查。所提出的方法展示了令人印象深刻的成功率,在所有白血病亚型中实现了大约 90% 的准确率,我们的实验结果证实了这一点。提供了实验结果的可视化表示,以增强模型的可解释性并帮助理解分类过程。 -##### **Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)** -2502.10768v1 by Sandra Schaftner +##### **An Empirical Analysis of Uncertainty in Large Language Model Evaluations** +2502.10709v1 by Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang -Current research highlights the great potential of Large Language Models -(LLMs) for constructing Scholarly Knowledge Graphs (SKGs). One particularly -complex step in this process is relation extraction, aimed at identifying -suitable properties to describe the content of research. This study builds -directly on previous research of three Open Research Knowledge Graph (ORKG) -team members who assessed the readiness of LLMs such as GPT-3.5, Llama 2, and -Mistral for property extraction in scientific literature. Given the moderate -performance observed, the previous work concluded that fine-tuning is needed to -improve these models' alignment with scientific tasks and their emulation of -human expertise. Expanding on this prior experiment, this study evaluates the -impact of advanced prompt engineering techniques and demonstrates that these -techniques can highly significantly enhance the results. Additionally, this -study extends the property extraction process to include property matching to -existing ORKG properties, which are retrieved via the API. The evaluation -reveals that results generated through advanced prompt engineering achieve a -higher proportion of matches with ORKG properties, further emphasizing the -enhanced alignment achieved. Moreover, this lays the groundwork for addressing -challenges such as the inconsistency of ORKG properties, an issue highlighted -in prior studies. By assigning unique URIs and using standardized terminology, -this work increases the consistency of the properties, fulfilling a crucial -aspect of Linked Data and FAIR principles - core commitments of ORKG. This, in -turn, significantly enhances the applicability of ORKG content for subsequent -tasks such as comparisons of research publications. Finally, the study -concludes with recommendations for future improvements in the overall property -extraction process. +As LLM-as-a-Judge emerges as a new paradigm for assessing large language +models (LLMs), concerns have been raised regarding the alignment, bias, and +stability of LLM evaluators. While substantial work has focused on alignment +and bias, little research has concentrated on the stability of LLM evaluators. +In this paper, we conduct extensive experiments involving 9 widely used LLM +evaluators across 2 different evaluation settings to investigate the +uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators +exhibit varying uncertainty based on model families and sizes. With careful +comparative analyses, we find that employing special prompting strategies, +whether during inference or post-training, can alleviate evaluation uncertainty +to some extent. By utilizing uncertainty to enhance LLM's reliability and +detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an +uncertainty-aware LLM evaluator named ConfiLM using a human-annotated +fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually +designed test set sourced from the 2024 Olympics. Experimental results +demonstrate that incorporating uncertainty as additional information during the +fine-tuning phase can largely improve the model's evaluation performance in OOD +scenarios. The code and data are released at: +https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty. -摘要:目前的調查強調大語言模型 (LLM) 在建構學術知識圖譜 (SKG) 上的巨大潛力。此過程中特別複雜的步驟是關係萃取,目標是找出合適的屬性來描述研究內容。本研究直接建立在三位開放研究知識圖譜 (ORKG) 團隊成員先前研究的基礎上,他們評估了 GPT-3.5、Llama 2 和 Mistral 等 LLM 在科學文獻中萃取屬性的準備情況。鑑於觀察到的表現中等,先前的研究結論是需要微調,以改善這些模型與科學任務的一致性,以及它們對人類專業知識的模擬。本研究擴展了先前的實驗,評估了進階提示工程技術的影響,並證明這些技術可以大幅顯著地提升結果。此外,本研究將屬性萃取流程擴展到包含與現有 ORKG 屬性的屬性比對,這些屬性是透過 API 擷取的。評估結果顯示,透過進階提示工程產生的結果與 ORKG 屬性有更高的比對比例,進一步強調所達成的進階一致性。此外,這也為了解決先前的研究中強調的問題,例如 ORKG 屬性的不一致性,奠定了基礎。透過指定唯一的 URI 並使用標準化的術語,本研究增加了屬性的相容性,達成了連結資料和 FAIR 原則的重要層面,這是 ORKG 的核心承諾。這反過來大幅提升了 ORKG 內容在後續任務中的適用性,例如研究出版品的比較。最後,本研究以針對整體屬性萃取流程未來改進的建議作為結論。 +摘要:隨著 LLM 作為法官的新典範出現,用於評估大型語言模型 (LLM) 的 LLM 評估器在對齊、偏差和穩定性方面引發了關注。儘管大量工作集中在對齊和偏差上,但很少有研究集中在 LLM 評估器的穩定性上。在本文中,我們進行了廣泛的實驗,涉及 9 個廣泛使用的 LLM 評估器,跨越 2 個不同的評估設定,以調查基於模型的 LLM 評估中的不確定性。我們精確指出 LLM 評估器根據模型系列和大小表現出不同的不確定性。通過仔細的比較分析,我們發現採用特殊的提示策略(無論是在推理過程中還是訓練後)可以在一定程度上緩解評估不確定性。通過利用不確定性來增強 LLM 在 Out-Of-Distribution (OOD) 數據中的可靠性和檢測能力,我們進一步微調了一個名為 ConfiLM 的不確定性感知 LLM 評估器,使用人工註釋的微調設置,並評估 ConfiLM 在手動設計的、來自 2024 年奧運會的測試集上的 OOD 評估能力。實驗結果表明,在微調階段將不確定性作為附加信息納入其中可以在很大程度上提高模型在 OOD 場景中的評估性能。代碼和數據發布於: +https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty。 -##### **K-Edit: Language Model Editing with Contextual Knowledge Awareness** -2502.10626v1 by Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan +##### **Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model** +2502.10707v1 by Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong -As the world changes, we need to be able to update our models and correct -false information without costly retraining. Knowledge-based model editing -enables precise modifications to the weights of large language models in order -to modify the information encoded within. Recent approaches have seen success -in enabling recall of edited information for thousands of edits at once. -However, these approaches fail to produce edits that account for associated -contextual information. We present K-Edit, an effective approach to generating -contextually consistent knowledge edits. By using knowledge graphs, which -maintain contextual consistency when an edge is edited, we are able to generate -additional \textit{contextual edits} that ensure consistency of related -information in the language model. Our experiments demonstrate significant -improvements in multi-hop question answering while maintaining the general -effectiveness and scalability of model edits. +Electrocardiogram (ECG) is essential for the clinical diagnosis of +arrhythmias and other heart diseases, but deep learning methods based on ECG +often face limitations due to the need for high-quality annotations. Although +previous ECG self-supervised learning (eSSL) methods have made significant +progress in representation learning from unannotated ECG data, they typically +treat ECG signals as ordinary time-series data, segmenting the signals using +fixed-size and fixed-step time windows, which often ignore the form and rhythm +characteristics and latent semantic relationships in ECG signals. In this work, +we introduce a novel perspective on ECG signals, treating heartbeats as words +and rhythms as sentences. Based on this perspective, we first designed the +QRS-Tokenizer, which generates semantically meaningful ECG sentences from the +raw ECG signals. Building on these, we then propose HeartLang, a novel +self-supervised learning framework for ECG language processing, learning +general representations at form and rhythm levels. Additionally, we construct +the largest heartbeat-based ECG vocabulary to date, which will further advance +the development of ECG language processing. We evaluated HeartLang across six +public ECG datasets, where it demonstrated robust competitiveness against other +eSSL methods. Our data and code are publicly available at +https://github.com/PKUDigitalHealth/HeartLang. -摘要:隨著世界變化,我們需要能夠更新我們的模型,並在不進行昂貴的重新訓練的情況下更正錯誤資訊。基於知識的模型編輯能夠對大型語言模型的權重進行精確修改,以便修改其中編碼的資訊。最近的方法在一次啟用數千次編輯的編輯資訊的召回方面取得了成功。然而,這些方法無法產生考慮相關上下文資訊的編輯。我們提出 K-Edit,這是一種產生上下文一致的知識編輯的有效方法。通過使用知識圖,在編輯邊緣時保持上下文一致性,我們能夠產生額外的「上下文編輯」,以確保語言模型中相關資訊的一致性。我們的實驗證明了多跳問題回答的顯著改進,同時保持了模型編輯的一般有效性和可擴充性。 +摘要:心電圖 (ECG) 對於心律不整和其他心臟疾病的臨床診斷至關重要,但基於心電圖的深度學習方法通常會因需要高品質註解而面臨限制。儘管先前的 ECG 自我監督學習 (eSSL) 方法在從未註解的 ECG 資料中學習表徵方面取得顯著進展,但它們通常將 ECG 訊號視為普通的時間序列資料,使用固定大小和固定步長的時窗對訊號進行分段,這通常會忽略 ECG 訊號中的形式和節律特徵以及潛在的語義關係。在這項工作中,我們對 ECG 訊號引入了新的觀點,將心跳視為單字,將節律視為句子。基於此觀點,我們首先設計了 QRS-Tokenizer,它從原始 ECG 訊號中產生語義有意義的 ECG 句子。在此基礎上,我們提出了 HeartLang,一種用於 ECG 語言處理的新型自我監督學習框架,在形式和節律層面上學習一般表徵。此外,我們構建了迄今為止最大的基於心跳的 ECG 詞彙表,這將進一步促進 ECG 語言處理的發展。我們在六個公開的 ECG 資料集上評估了 HeartLang,它展示了與其他 eSSL 方法相比的強大競爭力。我們的資料和程式碼可在 https://github.com/PKUDigitalHealth/HeartLang 公開取得。 + +##### **Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction** +2502.10689v1 by Leisheng Yu, Yanxiao Cai, Minxing Zhang, Xia Hu + +The burgeoning volume of electronic health records (EHRs) has enabled deep +learning models to excel in predictive healthcare. However, for high-stakes +applications such as diagnosis prediction, model interpretability remains +paramount. Existing deep learning diagnosis prediction models with intrinsic +interpretability often assign attention weights to every past diagnosis or +hospital visit, providing explanations lacking flexibility and succinctness. In +this paper, we introduce SHy, a self-explaining hypergraph neural network +model, designed to offer personalized, concise and faithful explanations that +allow for interventions from clinical experts. By modeling each patient as a +unique hypergraph and employing a message-passing mechanism, SHy captures +higher-order disease interactions and extracts distinct temporal phenotypes as +personalized explanations. It also addresses the incompleteness of the EHR data +by accounting for essential false negatives in the original diagnosis record. A +qualitative case study and extensive quantitative evaluations on two real-world +EHR datasets demonstrate the superior predictive performance and +interpretability of SHy over existing state-of-the-art models. + +摘要:隨著電子健康紀錄 (EHR) 數量的激增,深度學習模型在預測保健方面表現出色。然而,對於診斷預測等高風險應用,模型的可解釋性仍然至關重要。現有的具有內在可解釋性的深度學習診斷預測模型通常會為每個過去的診斷或醫院就診分配注意力權重,提供的解釋缺乏靈活性且簡潔性。在本文中,我們介紹了 SHy,這是一個自解釋的超圖神經網路模型,旨在提供個性化、簡潔且忠實的解釋,讓臨床專家可以進行干預。通過將每個患者建模為一個獨特的超圖並採用訊息傳遞機制,SHy 捕捉到了高階疾病交互作用,並提取出不同的時間表型作為個性化解釋。它還通過考慮原始診斷記錄中的基本假陰性來解決電子健康紀錄資料的不完整性。對兩個真實世界電子健康紀錄資料集進行的定性案例研究和廣泛的定量評估表明,SHy 在預測效能和可解釋性方面優於現有的最先進模型。 ##### **ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis** 2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan @@ -8803,1348 +8812,1339 @@ serving as a valuable resource for training LLM. 摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。 -##### **GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs** -2502.10522v1 by Shima Khoshraftar, Niaz Abedini, Amir Hajian - -The application of large language models (LLMs) to graph data has attracted a -lot of attention recently. LLMs allow us to use deep contextual embeddings from -pretrained models in text-attributed graphs, where shallow embeddings are often -used for the text attributes of nodes. However, it is still challenging to -efficiently encode the graph structure and features into a sequential form for -use by LLMs. In addition, the performance of an LLM alone, is highly dependent -on the structure of the input prompt, which limits their effectiveness as a -reliable approach and often requires iterative manual adjustments that could be -slow, tedious and difficult to replicate programmatically. In this paper, we -propose GraphiT (Graphs in Text), a framework for encoding graphs into a -textual format and optimizing LLM prompts for graph prediction tasks. Here we -focus on node classification for text-attributed graphs. We encode the graph -data for every node and its neighborhood into a concise text to enable LLMs to -better utilize the information in the graph. We then further programmatically -optimize the LLM prompts using the DSPy framework to automate this step and -make it more efficient and reproducible. GraphiT outperforms our LLM-based -baselines on three datasets and we show how the optimization step in GraphiT -leads to measurably better results without manual prompt tweaking. We also -demonstrated that our graph encoding approach is competitive to other graph -encoding methods while being less expensive because it uses significantly less -tokens for the same task. - -摘要:大型語言模型 (LLM) 在圖表資料的應用最近備受關注。LLM 讓我們能夠在文字標記圖表中使用預訓練模型的深度脈絡嵌入,其中淺層嵌入通常用於節點的文字屬性。然而,要有效率地將圖表結構和特徵編碼成序列形式供 LLM 使用,仍然是一項挑戰。此外,單獨 LLM 的效能高度依賴輸入提示的結構,這限制了它們作為可靠方法的有效性,而且通常需要反覆的人工調整,這可能會緩慢、繁瑣且難以透過程式複製。在本文中,我們提出 GraphiT(文字中的圖表),一個用於將圖表編碼成文字格式並最佳化 LLM 提示以進行圖表預測任務的架構。在這裡,我們專注於文字標記圖表的節點分類。我們將每個節點及其鄰域的圖表資料編碼成簡潔的文字,讓 LLM 能夠更好地利用圖表中的資訊。然後,我們進一步透過程式最佳化 LLM 提示,使用 DSPy 架構自動化這個步驟,並使其更有效率且可複製。Graphite 在三個資料集上優於我們的基於 LLM 的基準,我們展示了 GraphiT 中的最佳化步驟如何導致顯著更好的結果,而無需手動調整提示。我們還證明了我們的圖表編碼方法與其他圖表編碼方法具有競爭力,同時成本更低,因為它在相同的任務中使用了顯著更少的標記。 - -##### **Do Large Language Models Reason Causally Like Us? Even Better?** -2502.10215v1 by Hanna M. Dettki, Brenden M. Lake, Charley M. Wu, Bob Rehder - -Causal reasoning is a core component of intelligence. Large language models -(LLMs) have shown impressive capabilities in generating human-like text, -raising questions about whether their responses reflect true understanding or -statistical patterns. We compared causal reasoning in humans and four LLMs -using tasks based on collider graphs, rating the likelihood of a query variable -occurring given evidence from other variables. We find that LLMs reason -causally along a spectrum from human-like to normative inference, with -alignment shifting based on model, context, and task. Overall, GPT-4o and -Claude showed the most normative behavior, including "explaining away", whereas -Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected -independence of causes - Claude the least - they exhibited strong associative -reasoning and predictive inference when assessing the likelihood of the effect -given its causes. These findings underscore the need to assess AI biases as -they increasingly assist human decision-making. - -摘要:因果推理是智能的核心組成部分。大型語言模型 (LLM) 在生成類人文本方面展現了令人印象深刻的能力,引發了關於它們的回應是否反映真實理解或統計模式的疑問。我們使用基於碰撞圖的任務比較了人類和四個 LLM 中的因果推理,根據其他變數的證據評估查詢變數發生的可能性。我們發現 LLM 沿著從類人到規範推論的光譜進行因果推理,對齊會根據模型、上下文和任務而改變。總體而言,GPT-4o 和 Claude 表現出最規範的行為,包括「解釋」,而 Gemini-Pro 和 GPT-3.5 則沒有。儘管所有代理都偏離了預期的原因獨立性 - Claude 最不偏離 - 但它們在評估給定原因的效果可能性時表現出強烈的關聯推理和預測推論。這些發現強調了評估 AI 偏差的必要性,因為它們越來越協助人類決策。 - -##### **Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages** -2502.10140v1 by Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann +##### **Optimizing CNN Architectures for Advanced Thoracic Disease Classification** +2502.10614v1 by Tejas Mirthipati -Low-resource languages (LRLs) face significant challenges in natural language -processing (NLP) due to limited data. While current state-of-the-art large -language models (LLMs) still struggle with LRLs, smaller multilingual models -(mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of -their capacity to low training data sizes. This study systematically -investigates parameter-efficient adapter-based methods for adapting mLMs to -LRLs, evaluating three architectures: Sequential Bottleneck, Invertible -Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and -structured knowledge from ConceptNet, we show that small adaptation datasets -(e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains -in intrinsic (masked language modeling) and extrinsic tasks (topic -classification, sentiment analysis, and named entity recognition). We find that -Sequential Bottleneck adapters excel in language modeling, while Invertible -Bottleneck adapters slightly outperform other methods on downstream tasks due -to better embedding alignment and larger parameter counts. Adapter-based -methods match or outperform full fine-tuning while using far fewer parameters, -and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, -GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves -performance, pre-training data size remains the dominant factor, especially for -languages with extensive pre-training coverage. +Machine learning, particularly convolutional neural networks (CNNs), has +shown promise in medical image analysis, especially for thoracic disease +detection using chest X-ray images. In this study, we evaluate various CNN +architectures, including binary classification, multi-label classification, and +ResNet50 models, to address challenges like dataset imbalance, variations in +image quality, and hidden biases. We introduce advanced preprocessing +techniques such as principal component analysis (PCA) for image compression and +propose a novel class-weighted loss function to mitigate imbalance issues. Our +results highlight the potential of CNNs in medical imaging but emphasize that +issues like unbalanced datasets and variations in image acquisition methods +must be addressed for optimal model performance. -摘要:低資源語言 (LRL) 由於資料有限,在自然語言處理 (NLP) 中面臨重大挑戰。雖然當前最先進的大型語言模型 (LLM) 仍難以處理 LRL,但較小的多語言模型 (mLMS),例如 mBERT 和 XLM-R,由於其容量更適合低訓練資料大小,因此提供了更大的希望。本研究系統性地探討了基於參數效率適配器的適配方法,以將 mLMS 適配到 LRL,評估了三種架構:順序瓶頸、可逆瓶頸和低秩適配。使用來自 GlotCC 的非結構化文本和來自 ConceptNet 的結構化知識,我們表明小型適配資料集(例如,高達 1 GB 的自由文本或幾 MB 的知識圖譜資料)在內在(遮蔽語言模型)和外在任務(主題分類、情緒分析和命名實體識別)中產生增益。我們發現順序瓶頸適配器在語言模型中表現出色,而可逆瓶頸適配器由於更好的嵌入對齊和更大的參數數量,在下游任務上略勝於其他方法。基於適配器的方法在使用更少參數的同時,可以匹配或優於完全微調,而較小的 mLM 被證明比 LLaMA-3、GPT-4 和基於 DeepSeek-R1 的蒸餾模型等大型 LLM 更適合 LRL。雖然適配可以提高效能,但預訓練資料大小仍然是主要因素,特別是對於預訓練覆蓋範圍廣泛的語言。 +摘要:機器學習,特別是卷積神經網路 (CNN) 已在醫學影像分析中展現出潛力,特別是使用胸部 X 光影像進行胸腔疾病偵測。在此研究中,我們評估各種 CNN 架構,包括二元分類、多標籤分類和 ResNet50 模型,以解決資料集不平衡、影像品質差異和隱藏偏差等挑戰。我們導入進階前處理技術,例如主成分分析 (PCA) 以進行影像壓縮,並提出一個新穎的類別加權損失函數來緩解不平衡問題。我們的結果突顯了 CNN 在醫學影像中的潛力,但強調必須解決資料集不平衡和影像擷取方法差異等問題,才能獲得最佳模型效能。 -##### **Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models** -2502.10090v1 by Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao +##### **PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation** +2502.10536v1 by Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner -Humans possess an extraordinary ability to understand and execute complex -manipulation tasks by interpreting abstract instruction manuals. For robots, -however, this capability remains a substantial challenge, as they cannot -interpret abstract instructions and translate them into executable actions. In -this paper, we present Manual2Skill, a novel framework that enables robots to -perform complex assembly tasks guided by high-level manual instructions. Our -approach leverages a Vision-Language Model (VLM) to extract structured -information from instructional images and then uses this information to -construct hierarchical assembly graphs. These graphs represent parts, -subassemblies, and the relationships between them. To facilitate task -execution, a pose estimation model predicts the relative 6D poses of components -at each assembly step. At the same time, a motion planning module generates -actionable sequences for real-world robotic implementation. We demonstrate the -effectiveness of Manual2Skill by successfully assembling several real-world -IKEA furniture items. This application highlights its ability to manage -long-horizon manipulation tasks with both efficiency and precision, -significantly enhancing the practicality of robot learning from instruction -manuals. This work marks a step forward in advancing robotic systems capable of -understanding and executing complex manipulation tasks in a manner akin to -human capabilities. +The interpretation of histopathology cases underlies many important +diagnostic and treatment decisions in medicine. Notably, this process typically +requires pathologists to integrate and summarize findings across multiple +slides per case. Existing vision-language capabilities in computational +pathology have so far been largely limited to small regions of interest, larger +regions at low magnification, or single whole-slide images (WSIs). This limits +interpretation of findings that span multiple high-magnification regions across +multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model +(LMM) with a 1-million token context window, we demonstrate the ability to +generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches +from multiple WSIs at 10X magnification. This is the equivalent of up to 11 +hours of video at 1 fps. Expert pathologist evaluations demonstrate that the +generated report text is clinically accurate and equivalent to or preferred +over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide +examples with up to 5 slides. While performance decreased for examples with 6 +or more slides, this study demonstrates the promise of leveraging the +long-context capabilities of modern LMMs for the uniquely challenging task of +medical report generation where each case can contain thousands of image +patches. -摘要:人類擁有理解並執行複雜操作任務的非凡能力,方法是詮釋抽象的說明手冊。然而,對機器人來說,這項能力仍然是一項重大的挑戰,因為它們無法詮釋抽象的指令並將其轉換為可執行的動作。在本文中,我們提出了 Manual2Skill,這是一個新穎的框架,使機器人能夠在高階手冊說明的指導下執行複雜的組裝任務。我們的做法利用視覺語言模型 (VLM) 從教學圖片中提取結構化資訊,然後使用此資訊來建構階層式組裝圖。這些圖表示零件、子組件以及它們之間的關係。為了促進任務執行,姿勢估計模型會預測每個組裝步驟中組件的相對 6D 姿勢。同時,動作規劃模組會產生適用於實際機器人實作的可操作順序。我們透過成功組裝幾個真實世界的 IKEA 家具來展示 Manual2Skill 的有效性。此應用程式突顯了它以高效率和高精準度管理長時程操作任務的能力,大幅提升機器人從說明手冊中學習的實用性。這項工作標誌著機器人系統在理解和執行複雜操作任務方面向前邁進了一步,其方式類似於人類的能力。 +摘要:組織病理學病例的解讀是許多重要的醫學診斷和治療決策的基礎。值得注意的是,這個過程通常需要病理學家整合和總結每個病例的許多玻片中的發現。迄今為止,計算機病理學中現有的視覺語言功能在很大程度上僅限於小範圍的感興趣區域、低倍率下的較大區域或單一的全玻片影像 (WSI)。這限制了跨多個 WSI 中多個高倍率區域的發現的解讀。通過使用 Gemini 1.5 Flash,一個具有 100 萬個令牌上下文視窗的大型多模態模型 (LMM),我們展示了從多個 WSI 中多達 40,000 個 768x768 像素圖像貼片(10 倍放大)生成底線診斷的能力。這相當於 1 fps 下長達 11 小時的影片。專家病理學家評估表明,生成的報告文字在臨床上是準確的,並且等同於或優於 68%(95% CI:[60%,76%])的多玻片範例(最多 5 個玻片)的原始報告。儘管對於有 6 個或更多玻片的範例,其性能下降,但這項研究證明了利用現代 LMM 的長上下文功能來應對獨特挑戰性的醫療報告生成任務,其中每個病例可能包含數千個影像貼片,這項任務的前景。 -##### **Decision Information Meets Large Language Models: The Future of Explainable Operations Research** -2502.09994v1 by Yansen Zhang, Qingcan Kang, Wing Yin Yu, Hailei Gong, Xiaojin Fu, Xiongwei Han, Tao Zhong, Chen Ma +##### **Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks** +2502.10526v2 by Venkatesh Sivaraman, Anika Vaishampayan, Xiaotong Li, Brian R Buck, Ziyong Ma, Richard D Boyce, Adam Perer -Operations Research (OR) is vital for decision-making in many industries. -While recent OR methods have seen significant improvements in automation and -efficiency through integrating Large Language Models (LLMs), they still -struggle to produce meaningful explanations. This lack of clarity raises -concerns about transparency and trustworthiness in OR applications. To address -these challenges, we propose a comprehensive framework, Explainable Operations -Research (EOR), emphasizing actionable and understandable explanations -accompanying optimization. The core of EOR is the concept of Decision -Information, which emerges from what-if analysis and focuses on evaluating the -impact of complex constraints (or parameters) changes on decision-making. -Specifically, we utilize bipartite graphs to quantify the changes in the OR -model and adopt LLMs to improve the explanation capabilities. Additionally, we -introduce the first industrial benchmark to rigorously evaluate the -effectiveness of explanations and analyses in OR, establishing a new standard -for transparency and clarity in the field. +Temporal predictive models have the potential to improve decisions in health +care, public services, and other domains, yet they often fail to effectively +support decision-makers. Prior literature shows that many misalignments between +model behavior and decision-makers' expectations stem from issues of model +specification, namely how, when, and for whom predictions are made. However, +model specifications for predictive tasks are highly technical and difficult +for non-data-scientist stakeholders to interpret and critique. To address this +challenge we developed Tempo, an interactive system that helps data scientists +and domain experts collaboratively iterate on model specifications. Using +Tempo's simple yet precise temporal query language, data scientists can quickly +prototype specifications with greater transparency about pre-processing +choices. Moreover, domain experts can assess performance within data subgroups +to validate that models behave as expected. Through three case studies, we +demonstrate how Tempo helps multidisciplinary teams quickly prune infeasible +specifications and identify more promising directions to explore. -摘要:作業研究 (OR) 對許多產業的決策制定至關重要。雖然近期的 OR 方法已透過整合大型語言模型 (LLM) 在自動化和效率方面取得顯著的進步,但它們在產生有意義的解釋方面仍面臨挑戰。這種缺乏明確性的情況會對 OR 應用中的透明度和可信度造成疑慮。為了應對這些挑戰,我們提出一個全面的架構,即可解釋作業研究 (EOR),強調在最佳化過程中提供可操作且易於理解的解釋。EOR 的核心是決策資訊的概念,它源自假設分析,並專注於評估複雜約束條件 (或參數) 變更對決策制定的影響。具體來說,我們利用二部圖量化 OR 模型的變化,並採用 LLM 來改善解釋能力。此外,我們引入了第一個產業基準,以嚴格評估 OR 中解釋和分析的有效性,為該領域的透明度和清晰度建立新的標準。 +摘要:時序預測模型有潛力改善醫療保健、公共服務和其他領域的決策,但它們經常無法有效支援決策者。先前的文獻顯示,模型行為與決策者期望之間的許多不一致源自於模型規範問題,也就是如何、何時以及針對誰進行預測。然而,預測任務的模型規範非常技術化,非數據科學家利害關係人難以解讀和批評。為了應對此挑戰,我們開發了 Tempo,一個互動式系統,可協助數據科學家和領域專家協同反覆運算模型規範。透過使用 Tempo 簡單但精確的時序查詢語言,數據科學家可以快速建構規範原型,並更透明地了解前處理的選擇。此外,領域專家可以評估資料子群組內的效能,以驗證模型是否如預期般運作。透過三個案例研究,我們展示 Tempo 如何協助跨領域團隊快速刪減不可行的規範,並找出更有希望探索的方向。 -##### **KGGen: Extracting Knowledge Graphs from Plain Text with Language Models** -2502.09956v1 by Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo +##### **A Robust Attack: Displacement Backdoor Attack** +2502.10490v1 by Yong Li, Han Gao -Recent interest in building foundation models for KGs has highlighted a -fundamental challenge: knowledge-graph data is relatively scarce. The -best-known KGs are primarily human-labeled, created by pattern-matching, or -extracted using early NLP techniques. While human-generated KGs are in short -supply, automatically extracted KGs are of questionable quality. We present a -solution to this data scarcity problem in the form of a text-to-KG generator -(KGGen), a package that uses language models to create high-quality graphs from -plaintext. Unlike other KG extractors, KGGen clusters related entities to -reduce sparsity in extracted KGs. KGGen is available as a Python library -(\texttt{pip install kg-gen}), making it accessible to everyone. Along with -KGGen, we release the first benchmark, Measure of of Information in Nodes and -Edges (MINE), that tests an extractor's ability to produce a useful KG from -plain text. We benchmark our new tool against existing extractors and -demonstrate far superior performance. +As artificial intelligence becomes more prevalent in our lives, people are +enjoying the convenience it brings, but they are also facing hidden threats, +such as data poisoning and adversarial attacks. These threats can have +disastrous consequences for the application of artificial intelligence, +especially for some applications that take effect immediately, such as +autonomous driving and medical fields. Among these threats, backdoor attacks +have left a deep impression on people with their concealment and simple +deployment, making them a threat that cannot be ignored, however, in the +process of deploying the backdoor model, the backdoor attack often has some +reasons that make it unsatisfactory in real-world applications, such as jitter +and brightness changes. Based on this, we propose a highly robust backdoor +attack that shifts the target sample and combines it with itself to form a +backdoor sample, the Displacement Backdoor Attack(DBA). Experimental results +show that the DBA attack can resist data augmentation that simulates real-world +differences, such as rotation and cropping. -摘要:最近对于构建知识图谱基础模型的兴趣凸显了一个基本挑战:知识图谱数据相对稀缺。最知名的知识图谱主要为人标注,由模式匹配创建,或使用早期自然语言处理技术提取。虽然人生成的知识图谱供不应求,但自动提取的知识图谱质量堪忧。我们以文本到知识图谱生成器 (KGGen) 的形式为这一数据稀缺问题提供了一个解决方案,这是一个使用语言模型从纯文本创建高质量图表的包。与其他知识图谱提取器不同,KGGen 对相关实体进行聚类以减少提取的知识图谱中的稀疏性。KGGen 可用作 Python 库(\texttt{pip install kg-gen}),使其所有人都能访问。除了 KGGen,我们还发布了第一个基准测试,即节点和边信息度量 (MINE),它测试了提取器从纯文本生成有用知识图谱的能力。我们针对现有提取器对我们的新工具进行基准测试,并展示了远超其性能。 +摘要:随着人工智能在我们的生活中变得越来越普遍,人们正在享受它带来的便利,但也面临着隐藏的威胁,例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果,特别是对于一些立即生效的应用,例如自动驾驶和医疗领域。在这些威胁中,后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象,使其成为不可忽视的威胁,然而,在部署后门模型的过程中,后门攻击往往存在一些使其在实际应用中不尽如人意的原因,例如抖动和亮度变化。基于此,我们提出了一种高度鲁棒的后门攻击,该攻击对目标样本进行平移并将其与自身结合以形成后门样本,即置换后门攻击 (DBA)。实验结果表明,DBA 攻击可以抵抗模拟真实世界差异的数据增强,例如旋转和裁剪。 -##### **ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation** -2502.09891v1 by Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, Yuchi Ma +##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** +2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker -Retrieval-Augmented Generation (RAG) has proven effective in integrating -external knowledge into large language models (LLMs) for question-answer (QA) -tasks. The state-of-the-art RAG approaches often use the graph data as the -external data since they capture the rich semantic information and link -relationships between entities. However, existing graph-based RAG approaches -cannot accurately identify the relevant information from the graph and also -consume large numbers of tokens in the online retrieval process. To address -these issues, we introduce a novel graph-based RAG approach, called Attributed -Community-based Hierarchical RAG (ArchRAG), by augmenting the question using -attributed communities, and also introducing a novel LLM-based hierarchical -clustering method. To retrieve the most relevant information from the graph for -the question, we build a novel hierarchical index structure for the attributed -communities and develop an effective online retrieval method. Experimental -results demonstrate that ArchRAG outperforms existing methods in terms of both -accuracy and token cost. +Explainability remains a significant problem for AI models in medical +imaging, making it challenging for clinicians to trust AI-driven predictions. +We introduce 3D ReX, the first causality-based post-hoc explainability tool for +3D models. 3D ReX uses the theory of actual causality to generate +responsibility maps which highlight the regions most crucial to the model's +decision. We test 3D ReX on a stroke detection model, providing insight into +the spatial distribution of features relevant to stroke. -摘要:檢索增強生成 (RAG) 已證明可將外部知識整合到大型語言模型 (LLM),用於問答 (QA) 任務。最先進的 RAG 方法通常使用圖形資料作為外部資料,因為它們擷取了豐富的語意資訊和實體之間的連結關係。然而,現有的基於圖形的 RAG 方法無法準確識別圖形中的相關資訊,而且在線上檢索過程中也會消耗大量的符號。為了解決這些問題,我們提出了一種新穎的基於圖形的 RAG 方法,稱為基於屬性社群的分層 RAG (ArchRAG),透過使用屬性社群來擴充問題,並引入一種新穎的基於 LLM 的分層聚類方法。為了從圖形中檢索與問題最相關的資訊,我們為屬性社群建立了一個新穎的分層索引結構,並開發了一種有效的線上檢索方法。實驗結果證明,ArchRAG 在準確性和符號成本方面都優於現有方法。 +摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 +我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 -##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing** -2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch +##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model** +2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott -Visual Question Answering (VQA) is a challenging problem that requires to -process multimodal input. Answer-Set Programming (ASP) has shown great -potential in this regard to add interpretability and explainability to modular -VQA architectures. In this work, we address the problem of how to integrate ASP -with modules for vision and natural language processing to solve a new and -demanding VQA variant that is concerned with images of graphs (not graphs in -symbolic form). Images containing graph-based structures are an ubiquitous and -popular form of visualisation. Here, we deal with the particular problem of -graphs inspired by transit networks, and we introduce a novel dataset that -amends an existing one by adding images of graphs that resemble metro lines. -Our modular neuro-symbolic approach combines optical graph recognition for -graph parsing, a pretrained optical character recognition neural network for -parsing labels, Large Language Models (LLMs) for language processing, and ASP -for reasoning. This method serves as a first baseline and achieves an overall -average accuracy of 73% on the dataset. Our evaluation provides further -evidence of the potential of modular neuro-symbolic systems, in particular with -pretrained models that do not involve any further training and logic -programming for reasoning, to solve complex VQA tasks. +In the analysis of remote healthcare monitoring data, time series +representation learning offers substantial value in uncovering deeper patterns +of patient behavior, especially given the fine temporal granularity of the +data. In this study, we focus on a dataset of home activity records from people +living with Dementia. We propose a two-stage self-supervised learning approach. +The first stage involves converting time-series activities into text strings, +which are then encoded by a fine-tuned language model. In the second stage, +these time-series vectors are bi-dimensionalized for applying PageRank method, +to analyze latent state transitions to quantitatively assess participants +behavioral patterns and identify activity biases. These insights, combined with +diagnostic data, aim to support personalized care interventions. -摘要:視覺問答(VQA)是一項具有挑戰性的問題,需要處理多模態輸入。答案集程式設計(ASP)在這方面顯示出巨大的潛力,可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中,我們探討如何將 ASP 與視覺和自然語言處理模組整合,以解決一個新的且要求嚴格的 VQA 變體,該變體與圖形影像(而非符號形式的圖形)有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡,我們處理受交通網路啟發的圖形特定問題,並引入一個新的資料集,透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型(LLM)進行語言處理,以及 ASP 進行推理。此方法作為第一個基準,在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力,特別是預先訓練的模型,這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理,以解決複雜的 VQA 任務。 +摘要:在遠程醫療監控數據分析中,時序表示學習在揭示患者行為的更深層模式方面提供了實質性的價值,特別是考慮到數據的精細時間粒度。在本研究中,我們專注於痴呆症患者居家活動記錄的數據集。我們提出了一種兩階段的自我監督學習方法。第一階段涉及將時序活動轉換為文本串,然後由微調語言模型編碼。在第二階段,這些時序向量被雙維化以應用 PageRank 方法,分析潛在狀態轉換以定量評估參與者的行為模式並識別活動偏差。這些見解與診斷數據相結合,旨在支持個性化護理干預。 -##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data** -2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai +##### **TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation** +2502.09931v1 by Ju-Hyeon Nam, Nur Suriza Syazwany, Sang-Chul Lee -The adoption of EHRs has expanded opportunities to leverage data-driven -algorithms in clinical care and research. A major bottleneck in effectively -conducting multi-institutional EHR studies is the data heterogeneity across -systems with numerous codes that either do not exist or represent different -clinical concepts across institutions. The need for data privacy further limits -the feasibility of including multi-institutional patient-level data required to -study similarities and differences across patient subgroups. To address these -challenges, we developed the GAME algorithm. Tested and validated across 7 -institutions and 2 languages, GAME integrates data in several levels: (1) at -the institutional level with knowledge graphs to establish relationships -between codes and existing knowledge sources, providing the medical context for -standard codes and their relationship to each other; (2) between institutions, -leveraging language models to determine the relationships between -institution-specific codes with established standard codes; and (3) quantifying -the strength of the relationships between codes using a graph attention -network. Jointly trained embeddings are created using transfer and federated -learning to preserve data privacy. In this study, we demonstrate the -applicability of GAME in selecting relevant features as inputs for AI-driven -algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. -We then highlight the application of GAME harmonized multi-institutional EHR -data in a study of Alzheimer's disease outcomes and suicide risk among patients -with mental health disorders, without sharing patient-level data outside -individual institutions. +Skip connection engineering is primarily employed to address the semantic gap +between the encoder and decoder, while also integrating global dependencies to +understand the relationships among complex anatomical structures in medical +image segmentation. Although several models have proposed transformer-based +approaches to incorporate global dependencies within skip connections, they +often face limitations in capturing detailed local features with high +computational complexity. In contrast, graph neural networks (GNNs) exploit +graph structures to effectively capture local and global features. Leveraging +these properties, we introduce an attentional cross-scale graph neural network +(ACS-GNN), which enhances the skip connection framework by converting +cross-scale feature maps into a graph structure and capturing complex +anatomical structures through node attention. Additionally, we observed that +deep learning models often produce uninformative feature maps, which degrades +the quality of spatial attention maps. To address this problem, we integrated +entropy-driven feature selection (EFS) with spatial attention, calculating an +entropy score for each channel and filtering out high-entropy feature maps. Our +innovative framework, TransGUNet, comprises ACS-GNN and EFS-based spatial +attentio} to effectively enhance domain generalizability across various +modalities by leveraging GNNs alongside a reliable spatial attention map, +ensuring more robust features within the skip connection. Through comprehensive +experiments and analysis, TransGUNet achieved superior segmentation performance +on six seen and eight unseen datasets, demonstrating significantly higher +efficiency compared to previous methods. -摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。 +摘要:跳躍連接工程主要用於解決編碼器和解碼器之間的語義鴻溝,同時還整合全局依賴關係以了解醫學影像分割中複雜解剖結構之間的關係。儘管有幾個模型提出了基於Transformer的架構來整合跳躍連接中的全局依賴關係,但它們在以高計算複雜度擷取詳細的局部特徵時常常面臨限制。相比之下,圖神經網路 (GNN) 利用圖結構有效擷取局部和全局特徵。利用這些屬性,我們引入了注意力跨尺度圖神經網路 (ACS-GNN),它通過將跨尺度特徵圖轉換為圖結構並通過節點注意力擷取複雜的解剖結構來增強跳躍連接框架。此外,我們觀察到深度學習模型通常會產生無意義的特徵圖,這會降低空間注意力圖的品質。為了解決這個問題,我們將熵驅動特徵選擇 (EFS) 與空間注意力整合在一起,為每個通道計算熵分數並濾出高熵特徵圖。我們創新的框架 TransGUNet 包含 ACS-GNN 和基於 EFS 的空間注意力,通過利用 GNN 以及可靠的空間注意力圖有效增強跨各種模態的域泛化能力,確保跳躍連接中更強大的特徵。透過全面的實驗和分析,TransGUNet 在六個已見和八個未見的資料集上實現了優異的分割效能,證明與先前的方法相比,效率顯著提高。 -##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy** -2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang +##### **Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos** +2502.09886v1 by Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, Pieter Abbeel -With the extensive application of Graph Neural Networks (GNNs) across various -domains, their trustworthiness has emerged as a focal point of research. Some -existing studies have shown that the integration of large language models -(LLMs) can improve the semantic understanding and generation capabilities of -GNNs, which in turn improves the trustworthiness of GNNs from various aspects. -Our review introduces a taxonomy that offers researchers a clear framework for -comprehending the principles and applications of different methods and helps -clarify the connections and differences among various approaches. Then we -systematically survey representative approaches along the four categories of -our taxonomy. Through our taxonomy, researchers can understand the applicable -scenarios, potential advantages, and limitations of each approach for the the -trusted integration of GNNs with LLMs. Finally, we present some promising -directions of work and future trends for the integration of LLMs and GNNs to -improve model trustworthiness. +Simulation offers a promising approach for cheaply scaling training data for +generalist policies. To scalably generate data from diverse and realistic +tasks, existing algorithms either rely on large language models (LLMs) that may +hallucinate tasks not interesting for robotics; or digital twins, which require +careful real-to-sim alignment and are hard to scale. To address these +challenges, we introduce Video2Policy, a novel framework that leverages +internet RGB videos to reconstruct tasks based on everyday human behavior. Our +approach comprises two phases: (1) task generation in simulation from videos; +and (2) reinforcement learning utilizing in-context LLM-generated reward +functions iteratively. We demonstrate the efficacy of Video2Policy by +reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, +which depicts diverse and complex human behaviors on 9 different tasks. Our +method can successfully train RL policies on such tasks, including complex and +challenging tasks such as throwing. Finally, we show that the generated +simulation data can be scaled up for training a general policy, and it can be +transferred back to the real robot in a Real2Sim2Real way. -摘要:隨著圖神經網路 (GNN) 在各種領域的廣泛應用,其可信度已成為研究的焦點。一些現有研究表明,整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力,進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法,為研究人員提供了一個清晰的架構,用於理解不同方法的原理和應用,並有助於釐清各種方法之間的關聯和差異。然後,我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法,可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後,我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢,以提升模型的可信度。 +摘要:模擬提供了一種有前途的方法,可以用於擴展訓練資料,以制定通才政策。為了從多樣化且逼真的任務中可擴充地產生資料,現有演算法仰賴大型語言模型 (LLM),這些模型可能會產生對機器人技術不感興趣的任務;或者仰賴數位雙胞胎,這需要仔細地將真實環境與模擬環境對齊,而且很難擴充。為了應對這些挑戰,我們引入了 Video2Policy,這是一個新穎的架構,它利用網路上的 RGB 影片,根據日常人類行為來重建任務。我們的做法包含兩個階段:(1) 從影片中在模擬環境中產生任務;以及 (2) 利用在情境中由 LLM 產生的獎勵函數,反覆進行強化學習。我們透過重建 Something-Something-v2 (SSv2) 資料集中的 100 多個影片來展示 Video2Policy 的效能,這些影片描繪了 9 項不同任務中多樣化且複雜的人類行為。我們的做法可以在這些任務上成功訓練 RL 政策,包括複雜且具挑戰性的任務,例如投擲。最後,我們展示了產生的模擬資料可以擴充到訓練一般政策,而且可以透過 Real2Sim2Real 的方式轉移回真實機器人。 -##### **Graph Foundation Models for Recommendation: A Comprehensive Survey** -2502.08346v3 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi +##### **HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation** +2502.09838v2 by Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi -Recommender systems (RS) serve as a fundamental tool for navigating the vast -expanse of online information, with deep learning advancements playing an -increasingly important role in improving ranking accuracy. Among these, graph -neural networks (GNNs) excel at extracting higher-order structural information, -while large language models (LLMs) are designed to process and comprehend -natural language, making both approaches highly effective and widely adopted. -Recent research has focused on graph foundation models (GFMs), which integrate -the strengths of GNNs and LLMs to model complex RS problems more efficiently by -leveraging the graph-based structure of user-item relationships alongside -textual understanding. In this survey, we provide a comprehensive overview of -GFM-based RS technologies by introducing a clear taxonomy of current -approaches, diving into methodological details, and highlighting key challenges -and future directions. By synthesizing recent advancements, we aim to offer -valuable insights into the evolving landscape of GFM-based recommender systems. +We present HealthGPT, a powerful Medical Large Vision-Language Model +(Med-LVLM) that integrates medical visual comprehension and generation +capabilities within a unified autoregressive paradigm. Our bootstrapping +philosophy is to progressively adapt heterogeneous comprehension and generation +knowledge to pre-trained large language models (LLMs). This is achieved through +a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is +complemented by a tailored hierarchical visual perception approach and a +three-stage learning strategy. To effectively learn the HealthGPT, we devise a +comprehensive medical domain-specific comprehension and generation dataset +called VL-Health. Experimental results demonstrate exceptional performance and +scalability of HealthGPT in medical visual unified tasks. Our project can be +accessed at https://github.com/DCDmllm/HealthGPT. -摘要:推薦系統 (RS) 是用於導航廣闊的線上資訊的基本工具,深度學習的進步在提升排名準確度方面扮演著日益重要的角色。其中,圖形神經網路 (GNN) 擅長萃取高階結構資訊,而大型語言模型 (LLM) 則設計用於處理和理解自然語言,這使得這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM),它整合了 GNN 和 LLM 的優點,透過利用使用者與項目關係的圖形化結構以及文字理解,更有效率地建構複雜的 RS 問題模型。在這項調查中,我們透過介紹當前方法的明確分類、深入探討方法論細節,以及強調關鍵挑戰和未來方向,提供了 GFM 為基礎的 RS 技術的全面概觀。透過綜合最近的進展,我們旨在提供對 GFM 為基礎的推薦系統不斷演變的版圖的寶貴見解。 +摘要:我們提出 HealthGPT,一種強大的醫學大型視覺語言模型 (Med-LVLM),它整合了醫學視覺理解和生成能力於一個統一的自動迴歸範例中。我們的引導哲學是逐步調整異質理解和生成知識以預先訓練大型語言模型 (LLM)。這是通過一種新穎的異質低秩適應 (H-LoRA) 技術實現的,該技術由量身定制的分層視覺感知方法和三階段學習策略補充。為了有效學習 HealthGPT,我們設計了一個全面的醫學領域特定理解和生成數據集,稱為 VL-Health。實驗結果證明了 HealthGPT 在醫學視覺統一任務中的卓越性能和可擴展性。我們的項目可以在 https://github.com/DCDmllm/HealthGPT 中訪問。 -##### **Self-Evaluation for Job-Shop Scheduling** -2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana +##### **Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games** +2502.09780v1 by Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi -Combinatorial optimization problems, such as scheduling and route planning, -are crucial in various industries but are computationally intractable due to -their NP-hard nature. Neural Combinatorial Optimization methods leverage -machine learning to address these challenges but often depend on sequential -decision-making, which is prone to error accumulation as small mistakes -propagate throughout the process. Inspired by self-evaluation techniques in -Large Language Models, we propose a novel framework that generates and -evaluates subsets of assignments, moving beyond traditional stepwise -approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a -heterogeneous graph neural network with a Transformer to build a policy model -and a self-evaluation function. Experimental validation on challenging, -well-known benchmarks demonstrates the effectiveness of our approach, -surpassing state-of-the-art methods. +Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of +applications involving the interaction of a group of agents in a shared unknown +environment. A prominent framework for studying MARL is Markov games, with the +goal of finding various notions of equilibria in a sample-efficient manner, +such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). +However, existing sample-efficient approaches either require tailored +uncertainty estimation under function approximation, or careful coordination of +the players. In this paper, we propose a novel model-based algorithm, called +VMG, that incentivizes exploration via biasing the empirical estimate of the +model parameters towards those with a higher collective best-response values of +all the players when fixing the other players' policies, thus encouraging the +policy to deviate from its current equilibrium for more exploration. VMG is +oblivious to different forms of function approximation, and permits +simultaneous and uncoupled policy updates of all players. Theoretically, we +also establish that VMG achieves a near-optimal regret for finding both the NEs +of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov +games under linear function approximation in an online environment, which +nearly match their counterparts with sophisticated uncertainty quantification. -摘要:組合優化問題,例如排程和路線規劃,在各行各業中至關重要,但由於它們的 NP 難度,在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰,但通常依賴於序貫決策制定,而序貫決策制定容易發生錯誤累積,因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發,我們提出了一個新的框架,可生成和評估作業子集,超越傳統的分步方法。應用於工作車間排程問題,我們的方法將異質圖神經網路與 Transformer 整合在一起,以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性,超越了最先進的方法。 +摘要:多智能體強化學習 (MARL) 是一系列應用程式的心臟,這些應用程式涉及一群智能體在一個共用未知環境中的互動。研究 MARL 的一個著名框架是馬可夫博弈,其目標是用樣本有效率的方式找出各種均衡概念,例如納許均衡 (NE) 和粗相關均衡 (CCE)。然而,現有的樣本有效率方法需要在函數逼近下進行量身打造的不確定性估計,或謹慎協調參與者。在本文中,我們提出了一種新的基於模型的演算法,稱為 VMG,它透過將模型參數的經驗估計值偏向於在固定其他參與者政策時所有參與者的集體最佳反應值,從而激勵探索,進而鼓勵政策偏離其當前均衡以進行更多探索。VMG 不會忽略函數逼近的不同形式,並允許所有參與者同時進行非耦合的政策更新。在理論上,我們也建立了 VMG 在線上環境中使用線性函數逼近來尋找雙人零和馬可夫博弈的 NE 和多人一般和馬可夫博弈的 CCE 時,會獲得接近最佳的後悔,這幾乎與其在不確定性量化方面更為複雜的對應物相匹配。 -##### **Improving Existing Optimization Algorithms with LLMs** -2502.08298v1 by Camilo Chacón Sartori, Christian Blum +##### **The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention** +2502.09757v1 by Bereket A. Yilma, Chan Mi Kim, Geke Ludden, Thomas van Rompay, Luis A. Leiva -The integration of Large Language Models (LLMs) into optimization has created -a powerful synergy, opening exciting research opportunities. This paper -investigates how LLMs can enhance existing optimization algorithms. Using their -pre-trained knowledge, we demonstrate their ability to propose innovative -heuristic variations and implementation strategies. To evaluate this, we -applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt -(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that -incorporates a heuristic in the solution construction phase. Our results show -that an alternative heuristic proposed by GPT-4o outperforms the -expert-designed heuristic of CMSA, with the performance gap widening on larger -and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/ +Post-intensive care syndrome (PICS) is a multifaceted condition that arises +from prolonged stays in an intensive care unit (ICU). While preventing PICS +among ICU patients is becoming increasingly important, interventions remain +limited. Building on evidence supporting the effectiveness of art exposure in +addressing the psychological aspects of PICS, we propose a novel art therapy +solution through a collaborative Human-AI approach that enhances personalized +therapeutic interventions using state-of-the-art Visual Art Recommendation +Systems. We developed two Human-in-the-Loop (HITL) personalization methods and +assessed their impact through a large-scale user study (N=150). Our findings +demonstrate that this Human-AI collaboration not only enhances the +personalization and effectiveness of art therapy but also supports therapists +by streamlining their workload. While our study centres on PICS intervention, +the results suggest that human-AI collaborative Art therapy could potentially +benefit other areas where emotional support is critical, such as cases of +anxiety and depression. -摘要:大型语言模型 (LLM) 与优化相结合,创造了一种强大的协同作用,开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识,我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点,我们应用了一种非平凡的优化算法,构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法,它在求解构建阶段纳入了启发式算法。我们的结果表明,GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法,并且随着图形变得更大、更密集,性能差距也在扩大。项目网址:https://imp-opt-algo-llms.surge.sh/ +摘要:重症後症候群 (PICS) 是一種多面向的疾病,源自於在加護病房 (ICU) 長期住院。雖然預防重症後症候群在加護病房患者中正變得越來越重要,但介入措施仍然有限。建立在支持藝術接觸在解決重症後症候群心理層面的證據上,我們提出一個創新的藝術療法解決方案,透過協作式的人工智慧方法,使用最先進的視覺藝術推薦系統,增強個人化的治療介入。我們開發了兩種人機迴路 (HITL) 個人化方法,並透過大規模使用者研究 (N=150) 評估其影響。我們的發現證明,這種人機協作不僅增強了藝術治療的個人化和有效性,也透過簡化治療師的工作量來提供支援。雖然我們的研究中心在重症後症候群介入,但結果顯示,人機協作藝術療法有可能對其他需要情緒支持的領域有益,例如焦慮和憂鬱症。 -##### **LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search** -2502.10459v1 by Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang +##### **A CNN Approach to Automated Detection and Classification of Brain Tumors** +2502.09731v1 by Md. Zahid Hasan, Abdullah Tamim, D. M. Asadujjaman, Md. Mahfujur Rahman, Md. Abu Ahnaf Mollick, Nosin Anjum Dristi, Abdullah-Al-Noman -Graph Neural Architecture Search (GNAS) facilitates the automatic design of -Graph Neural Networks (GNNs) tailored to specific downstream graph learning -tasks. However, existing GNAS approaches often require manual adaptation to new -graph search spaces, necessitating substantial code optimization and -domain-specific knowledge. To address this challenge, we present LLM4GNAS, a -toolkit for GNAS that leverages the generative capabilities of Large Language -Models (LLMs). LLM4GNAS includes an algorithm library for graph neural -architecture search algorithms based on LLMs, enabling the adaptation of GNAS -methods to new search spaces through the modification of LLM prompts. This -approach reduces the need for manual intervention in algorithm adaptation and -code modification. The LLM4GNAS toolkit is extensible and robust, incorporating -LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture -search, and LLM-enhanced hyperparameter optimization. Experimental results -indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving -both homogeneous and heterogeneous graphs. +Brain tumors require an assessment to ensure timely diagnosis and effective +patient treatment. Morphological factors such as size, location, texture, and +variable appearance complicate tumor inspection. Medical imaging presents +challenges, including noise and incomplete images. This research article +presents a methodology for processing Magnetic Resonance Imaging (MRI) data, +encompassing techniques for image classification and denoising. The effective +use of MRI images allows medical professionals to detect brain disorders, +including tumors. This research aims to categorize healthy brain tissue and +brain tumors by analyzing the provided MRI data. Unlike alternative methods +like Computed Tomography (CT), MRI technology offers a more detailed +representation of internal anatomical components, making it a suitable option +for studying data related to brain tumors. The MRI picture is first subjected +to a denoising technique utilizing an Anisotropic diffusion filter. The dataset +utilized for the models creation is a publicly accessible and validated Brain +Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE +was employed for data augmentation and dataset balancing. Convolutional Neural +Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for +the classification procedure. EfficientNet attained an accuracy of 98%, the +highest recorded. -摘要:圖形神經架構搜尋 (GNAS) 促進圖形神經網路 (GNN) 的自動設計,以符合特定下游圖形學習任務。然而,現有的 GNAS 方法通常需要手動調整至新的圖形搜尋空間,這需要大量的程式碼最佳化和領域特定知識。為了應對這項挑戰,我們提出 LLM4GNAS,一個利用大型語言模型 (LLM) 的生成能力的 GNAS 工具包。LLM4GNAS 包含一個基於 LLM 的圖形神經架構搜尋演算法函式庫,讓 GNAS 方法能夠透過修改 LLM 提示來適應新的搜尋空間。這種方法減少了演算法適應和程式碼修改中手動介入的需要。LLM4GNAS 工具包具有可擴充性和穩健性,整合了 LLM 增強的圖形特徵工程、LLM 增強的圖形神經架構搜尋和 LLM 增強的超參數最佳化。實驗結果表明,LLM4GNAS 在涉及同質和異質圖形的任務上優於現有的 GNAS 方法。 +摘要:腦腫瘤需要評估以確保及時診斷和有效的患者治療。大小、位置、質地和可變外觀等形態因素會使腫瘤檢查複雜化。醫學影像會呈現挑戰,包括雜訊和不完整的影像。本研究文章提出了一種處理磁共振影像 (MRI) 資料的方法,包含影像分類和去噪技術。有效使用 MRI 影像可讓醫護人員偵測腦部疾病,包括腫瘤。本研究旨在透過分析提供的 MRI 資料來分類健康的腦組織和腦瘤。與電腦斷層掃描 (CT) 等替代方法不同,MRI 技術提供了更詳細的內部解剖結構表示,使其成為研究與腦瘤相關資料的合適選擇。MRI 影像會先使用各向異性擴散濾波器進行去噪技術處理。用於建立模型的資料集是一個公開且經過驗證的腦腫瘤分類 (MRI) 資料庫,包含 3,264 個腦部 MRI 掃描。SMOTE 用於資料擴充和資料集平衡。卷積神經網路 (CNN),例如 ResNet152V2、VGG、ViT 和 EfficientNet,用於分類程序。EfficientNet 達到了 98% 的準確度,是記錄到的最高值。 -##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning** -2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari +##### **Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data** +2502.09715v1 by Yu Leng, Yingnan He, Colin Magdamo, Ana-Maria Vranceanu, Christine S. Ritchie, Shibani S. Mukerji, Lidia M. V. R. Moura, John R. Dickson, Deborah Blacker, Sudeshna Das -Identifying cause-and-effect relationships is critical to understanding -real-world dynamics and ultimately causal reasoning. Existing methods for -identifying event causality in NLP, including those based on Large Language -Models (LLMs), exhibit difficulties in out-of-distribution settings due to the -limited scale and heavy reliance on lexical cues within available benchmarks. -Modern benchmarks, inspired by probabilistic causal inference, have attempted -to construct causal graphs of events as a robust representation of causal -knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent -benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a -benchmark designed for discovery and reasoning over abstract causal events. -Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday -life events on the abstraction level. We propose a pipeline for identifying -abstractions for event generalizations from \texttt{GLUCOSE} -\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit -commonsense causal knowledge, from which we subsequently extract $1,4$K causal -pairs. Our experiments highlight the ongoing challenges of using statistical -methods and/or LLMs for automatic abstraction identification and causal -discovery in NLP. Nonetheless, we demonstrate that the abstract causal -knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA -reasoning performance in LLMs. +Identifying cognitive impairment within electronic health records (EHRs) is +crucial not only for timely diagnoses but also for facilitating research. +Information about cognitive impairment often exists within unstructured +clinician notes in EHRs, but manual chart reviews are both time-consuming and +error-prone. To address this issue, our study evaluates an automated approach +using zero-shot GPT-4o to determine stage of cognitive impairment in two +different tasks. First, we evaluated the ability of GPT-4o to determine the +global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who +visited the memory clinic at Massachusetts General Hospital (MGH), and achieved +a weighted kappa score of 0.83. Second, we assessed GPT-4o's ability to +differentiate between normal cognition, mild cognitive impairment (MCI), and +dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o +attained a weighted kappa score of 0.91 in comparison to specialist chart +reviews and 0.96 on cases that the clinical adjudicators rated with high +confidence. Our findings demonstrate GPT-4o's potential as a scalable chart +review tool for creating research datasets and assisting diagnosis in clinical +settings in the future. -摘要:找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法,包括基於大型語言模型 (LLM) 的方法,由於規模有限且過度依賴於可用基準中的詞彙線索,在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖,作為因果知識的強健表示,其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中,我們介紹 \texttt{ACCESS},一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同,\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道,用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象,\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集,我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此,我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。 +摘要:在電子健康記錄 (EHR) 中識別認知障礙不僅對及時診斷至關重要,也有助於促進研究。有關認知障礙的資訊通常存在於 EHR 中非結構化的臨床記錄中,但手動圖表審查既耗時又容易出錯。為了解決這個問題,我們的研究評估了一種自動化方法,使用零次學習的 GPT-4o 來確定兩種不同任務中的認知障礙分期。首先,我們評估了 GPT-4o 確定來自麻薩諸塞州總醫院 (MGH) 記憶診所 769 名患者的專科記錄的全球臨床痴呆評分 (CDR) 的能力,並獲得了 0.83 的加權 kappa 分數。其次,我們評估了 GPT-4o 在 860 名 Medicare 患者 3 年視窗中的所有記錄中區分正常認知、輕度認知障礙 (MCI) 和痴呆的能力。與專科圖表審查相比,GPT-4o 獲得了 0.91 的加權 kappa 分數,而對於臨床評審員以高度信心評估的病例,其加權 kappa 分數為 0.96。我們的研究結果證明了 GPT-4o 作為可擴充圖表審查工具的潛力,可用於建立研究資料集並協助未來臨床環境中的診斷。 + +##### **Metamorphic Testing for Pose Estimation Systems** +2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque + +Pose estimation systems are used in a variety of fields, from sports +analytics to livestock care. Given their potential impact, it is paramount to +systematically test their behaviour and potential for failure. This is a +complex task due to the oracle problem and the high cost of manual labelling +necessary to build ground truth keypoints. This problem is exacerbated by the +fact that different applications require systems to focus on different subjects +(e.g., human versus animal) or landmarks (e.g., only extremities versus whole +body and face), which makes labelled test data rarely reusable. To combat these +problems we propose MET-POSE, a metamorphic testing framework for pose +estimation systems that bypasses the need for manual annotation while assessing +the performance of these systems under different circumstances. MET-POSE thus +allows users of pose estimation systems to assess the systems in conditions +that more closely relate to their application without having to label an ad-hoc +test dataset or rely only on available datasets, which may not be adapted to +their application domain. While we define MET-POSE in general terms, we also +present a non-exhaustive list of metamorphic rules that represent common +challenges in computer vision applications, as well as a specific way to +evaluate these rules. We then experimentally show the effectiveness of MET-POSE +by applying it to Mediapipe Holistic, a state of the art human pose estimation +system, with the FLIC and PHOENIX datasets. With these experiments, we outline +numerous ways in which the outputs of MET-POSE can uncover faults in pose +estimation systems at a similar or higher rate than classic testing using hand +labelled data, and show that users can tailor the rule set they use to the +faults and level of accuracy relevant to their application. -##### **Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality** -2502.09658v1 by Xin Kang, Veronika Shteingardt, Yuhan Wang, Dov Dori +摘要:姿勢估計系統應用於各種領域,從運動分析到牲畜照護。鑑於其潛在影響,系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高,這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體(例如,人類對動物)或地標(例如,只有四肢對全身和臉部)而加劇,這使得標記的測試數據很少可以重複使用。為了解決這些問題,我們提出了 MET-POSE,這是一個姿勢估計系統的變形測試框架,在評估這些系統在不同情況下的性能時,可以繞過手動註解的需要。因此,MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統,而無需標記臨時測試數據集或僅依賴可用數據集,這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE,但我們也提供了一個非詳盡的變形規則列表,這些規則代表了電腦視覺應用中的常見挑戰,以及評估這些規則的具體方法。然後,我們通過將 MET-POSE 應用於 Mediapipe Holistic(一種先進的人類姿勢估計系統),並使用 FLIC 和 PHOENIX 數據集,以實驗方式展示 MET-POSE 的有效性。通過這些實驗,我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法,其速度與使用手動標記數據的傳統測試類似或更高,並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。 -Knowledge representation and reasoning are critical challenges in Artificial -Intelligence (AI), particularly in integrating neural and symbolic approaches -to achieve explainable and transparent AI systems. Traditional knowledge -representation methods often fall short of capturing complex processes and -state changes. We introduce Neuro-Conceptual Artificial Intelligence (NCAI), a -specialization of the neuro-symbolic AI approach that integrates conceptual -modeling using Object-Process Methodology (OPM) ISO 19450:2024 with deep -learning to enhance question-answering (QA) quality. By converting natural -language text into OPM models using in-context learning, NCAI leverages the -expressive power of OPM to represent complex OPM elements-processes, objects, -and states-beyond what traditional triplet-based knowledge graphs can easily -capture. This rich structured knowledge representation improves reasoning -transparency and answer accuracy in an OPM-QA system. We further propose -transparency evaluation metrics to quantitatively measure how faithfully the -predicted reasoning aligns with OPM-based conceptual logic. Our experiments -demonstrate that NCAI outperforms traditional methods, highlighting its -potential for advancing neuro-symbolic AI by providing rich knowledge -representations, measurable transparency, and improved reasoning. +##### **Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling** +2502.09688v1 by Benjamin D. Killeen, Bohua Wan, Aditya V. Kulkarni, Nathan Drenkow, Michael Oberst, Paul H. Yi, Mathias Unberath -摘要:知識表徵與推理是人工智慧 (AI) 中的重大挑戰,特別是在整合神經與符號方法以實現可解釋且透明的人工智慧系統時。傳統的知識表徵方法通常無法捕捉複雜的流程和狀態變化。我們引入了神經概念人工智慧 (NCAI),一種神經符號 AI 方法的專門化,它將使用物件流程方法 (OPM) ISO 19450:2024 的概念建模與深度學習整合在一起,以提升問答 (QA) 的品質。透過使用情境學習將自然語言文字轉換為 OPM 模型,NCAI 充分利用 OPM 的表達能力來表徵複雜的 OPM 元素(流程、物件和狀態),超越傳統的三元組知識圖表容易捕捉的範圍。這種豐富的結構化知識表徵改善了 OPM-QA 系統中的推理透明度和答案準確度。我們進一步提出了透明度評估指標,以量化測量預測推理與基於 OPM 的概念邏輯的吻合程度。我們的實驗證明,NCAI 優於傳統方法,突顯了它在透過提供豐富的知識表徵、可測量的透明度和改善的推理來推進神經符號 AI 的潛力。 +Artificial intelligence (AI) is poised to transform healthcare by enabling +personalized and efficient care through data-driven insights. Although +radiology is at the forefront of AI adoption, in practice, the potential of AI +models is often overshadowed by severe failures to generalize: AI models can +have performance degradation of up to 20% when transitioning from controlled +test environments to clinical use by radiologists. This mismatch raises +concerns that radiologists will be misled by incorrect AI predictions in +practice and/or grow to distrust AI, rendering these promising technologies +practically ineffectual. Exhaustive clinical trials of AI models on abundant +and diverse data is thus critical to anticipate AI model degradation when +encountering varied data samples. Achieving these goals, however, is +challenging due to the high costs of collecting diverse data samples and +corresponding annotations. To overcome these limitations, we introduce a novel +conditional generative AI model designed for virtual clinical trials (VCTs) of +radiology AI, capable of realistically synthesizing full-body CT images of +patients with specified attributes. By learning the joint distribution of +images and anatomical structures, our model enables precise replication of +real-world patient populations with unprecedented detail at this scale. We +demonstrate meaningful evaluation of radiology AI models through VCTs powered +by our synthetic CT study populations, revealing model degradation and +facilitating algorithmic auditing for bias-inducing data attributes. Our +generative AI approach to VCTs is a promising avenue towards a scalable +solution to assess model robustness, mitigate biases, and safeguard patient +care by enabling simpler testing and evaluation of AI models in any desired +range of diverse patient populations. -##### **GCoT: Chain-of-Thought Prompt Learning for Graphs** -2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang +摘要:人工智慧 (AI) 準備透過資料驅動的見解,轉型醫療保健,並提供個人化且有效率的照護。儘管放射科處於 AI 採用的最前線,但在實務上,AI 模型的潛力往往會被嚴重的概化失敗所掩蓋:AI 模型在從受控測試環境轉移到放射科醫師的臨床使用時,效能可能會降低多達 20%。這種不匹配引發了疑慮,即放射科醫師在實務上會被不正確的 AI 預測誤導,和/或開始不信任 AI,讓這些有前景的技術在實務上形同失效。因此,在 AI 模型遭遇各種資料範例時,預期 AI 模型的衰退,對豐富且多樣化的資料進行 AI 模型的全面臨床試驗至關重要。然而,由於收集多樣化的資料範例和對應註解的成本很高,實現這些目標具有挑戰性。為了克服這些限制,我們引進一個創新的條件式生成式 AI 模型,專門用於放射科 AI 的虛擬臨床試驗 (VCT),能夠真實地合成具有特定屬性的病患全身電腦斷層 (CT) 影像。透過學習影像和解剖結構的聯合分佈,我們的模型能夠以空前的細節精確複製真實世界的病患族群。我們透過由我們合成的電腦斷層研究族群支援的 VCT,展示了放射科 AI 模型有意義的評估,揭露模型衰退,並促進演算法稽核,以找出導致偏差的資料屬性。我們對 VCT 的生成式 AI 方法,是一個有前景的途徑,可以評估模型的穩健性、減輕偏差,並透過在任何所需的各種病患族群中,進行更簡單的 AI 模型測試和評估,來保障病患照護。 -Chain-of-thought (CoT) prompting has achieved remarkable success in natural -language processing (NLP). However, its vast potential remains largely -unexplored for graphs. This raises an interesting question: How can we design -CoT prompting for graphs to guide graph models to learn step by step? On one -hand, unlike natural languages, graphs are non-linear and characterized by -complex topological structures. On the other hand, many graphs lack textual -data, making it difficult to formulate language-based CoT prompting. In this -work, we propose the first CoT prompt learning framework for text-free graphs, -GCoT. Specifically, we decompose the adaptation process for each downstream -task into a series of inference steps, with each step consisting of -prompt-based inference, ``thought'' generation, and thought-conditioned prompt -learning. While the steps mimic CoT prompting in NLP, the exact mechanism -differs significantly. Specifically, at each step, an input graph, along with a -prompt, is first fed into a pre-trained graph encoder for prompt-based -inference. We then aggregate the hidden layers of the encoder to construct a -``thought'', which captures the working state of each node in the current step. -Conditioned on this thought, we learn a prompt specific to each node based on -the current state. These prompts are fed into the next inference step, -repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we -conduct comprehensive experiments on eight public datasets, which demonstrate -the advantage of our approach. +##### **Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models** +2502.09687v1 by Wiktoria Mieleszczenko-Kowszewicz, Beata Bajcar, Jolanta Babiak, Berenika Dyczek, Jakub Świstak, Przemysław Biecek -摘要:鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而,其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題:我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習?一方面,與自然語言不同,圖形是非線性的,並且具有複雜的拓撲結構。另一方面,許多圖形缺乏文本數據,這使得難以制定基於語言的 CoT 提示。在這項工作中,我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說,我們將每個下游任務的適應過程分解為一系列推理步驟,每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示,但具體機制卻有很大不同。具體來說,在每一步中,一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後,我們聚合編碼器的隱藏層以構建一個「思想」,它捕獲了當前步驟中每個節點的工作狀態。基於這個思想,我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中,重複這個循環。為了評估和分析 GCoT 的有效性,我們對八個公共數據集進行了全面的實驗,這證明了我們方法的優勢。 +Be careful what you ask for, you just might get it. This saying fits with the +way large language models (LLMs) are trained, which, instead of being rewarded +for correctness, are increasingly rewarded for pleasing the recipient. So, they +are increasingly effective at persuading us that their answers are valuable. +But what tricks do they use in this persuasion? In this study, we examine what +are the psycholinguistic features of the responses used by twelve different +language models. By grouping response content according to rational or +emotional prompts and exploring social influence principles employed by LLMs, +we ask whether and how we can mitigate the risks of LLM-driven mass +misinformation. We position this study within the broader discourse on +human-centred AI, emphasizing the need for interdisciplinary approaches to +mitigate cognitive and societal risks posed by persuasive AI responses. -##### **Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach** -2502.10453v1 by Régnier Avice, Bernhard Haslhofer, Zhidong Li, Jianlong Zhou +摘要:小心你要求的,你可能真的會得到。這句話適用於大型語言模型 (LLM) 的訓練方式,它們不是因為正確性而獲得獎勵,而是因為取悅接收者而獲得越來越多的獎勵。因此,它們越來越有效地說服我們,它們的答案是有價值的。但是它們在這種說服中使用什麼技巧呢?在這項研究中,我們探討了十二種不同的語言模型使用的回應的心理語言特徵。通過根據理性和情緒提示對回應內容進行分組,並探討 LLM 使用的社會影響原則,我們探討是否以及如何減輕 LLM 驅動的大規模錯誤信息的風險。我們將這項研究定位在以人為中心的 AI 的更廣泛討論中,強調需要跨學科方法來減輕具有說服力的 AI 回應帶來的認知和社會風險。 -Attribution tags form the foundation of modern cryptoasset forensics. -However, inconsistent or incorrect tags can mislead investigations and even -result in false accusations. To address this issue, we propose a novel -computational method based on Large Language Models (LLMs) to link attribution -tags with well-defined knowledge graph concepts. We implemented this method in -an end-to-end pipeline and conducted experiments showing that our approach -outperforms baseline methods by up to 37.4% in F1-score across three publicly -available attribution tag datasets. By integrating concept filtering and -blocking procedures, we generate candidate sets containing five knowledge graph -entities, achieving a recall of 93% without the need for labeled data. -Additionally, we demonstrate that local LLM models can achieve F1-scores of -90%, comparable to remote models which achieve 94%. We also analyze the -cost-performance trade-offs of various LLMs and prompt templates, showing that -selecting the most cost-effective configuration can reduce costs by 90%, with -only a 1% decrease in performance. Our method not only enhances attribution tag -quality but also serves as a blueprint for fostering more reliable forensic -evidence. +##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics** +2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing -摘要:歸因標籤構成現代加密資產鑑識的基礎。 -然而,不一致或不正確的標籤會誤導調查,甚至導致錯誤的指控。為了解決這個問題,我們提出了一種基於大型語言模型 (LLM) 的新型計算方法,將歸因標籤與定義明確的知識圖譜概念連結起來。我們在端到端管道中實施了這種方法,並進行了實驗,結果顯示我們的做法在三個公開可用的歸因標籤資料集中,F1 分數比基線方法高出 37.4%。透過整合概念過濾和封鎖程序,我們生成了包含五個知識圖譜實體的候選集,在不需要標籤資料的情況下,達到了 93% 的召回率。 -此外,我們證明了本機 LLM 模型可以達到 90% 的 F1 分數,與達到 94% 的遠端模型相當。我們也分析了各種 LLM 和提示範本的成本效益權衡,結果顯示選擇最具成本效益的設定可以將成本降低 90%,而效能只下降 1%。我們的做法不僅提升了歸因標籤的品質,也作為促進更可靠鑑識證據的藍圖。 +Joint entity-relation extraction is a critical task in transforming +unstructured or semi-structured text into triplets, facilitating the +construction of large-scale knowledge graphs, and supporting various downstream +applications. Despite its importance, research on Chinese text, particularly +with complex semantics in specialized domains like medicine, remains limited. +To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions +dataset designed to capture the intricacies of medical text. Leveraging the +strengths of attention mechanisms in capturing long-range dependencies, we +propose the SEA module, which enhances the extraction of complex contextual +semantic information, thereby improving entity recognition and relation +extraction. Additionally, to address the inefficiencies of existing methods in +facilitating information exchange between entity recognition and relation +extraction, we present an interactive fusion representation module. This module +employs Cross Attention for bidirectional information exchange between the +tasks and further refines feature extraction through BiLSTM. Experimental +results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that +our model exhibits strong generalization capabilities. On the CH-DDI dataset, +our model achieves an F1-score of 96.73% for entity recognition and 78.43% for +relation extraction. On the CoNLL04 dataset, it attains an entity recognition +precision of 89.54% and a relation extraction accuracy of 71.64%. -##### **Deep Semantic Graph Learning via LLM based Node Enhancement** -2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen +摘要:聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務,有助於建構大規模知識圖譜,並支援各種下游應用程式。儘管其重要性,但針對中文文本的研究,特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距,我們引入了 CH-DDI,一個中文藥物-藥物交互作用資料集,旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢,我們提出了 SEA 模組,增強了複雜脈絡語義資訊的抽取,從而改進了實體辨識和關係抽取。此外,為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題,我們提出了互動式融合表示模組。此模組採用交叉注意力,在任務之間進行雙向資訊交換,並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明,我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上,我們的模型在實體辨識方面達到了 96.73% 的 F1 分數,在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上,它在實體辨識方面達到了 89.54% 的準確度,在關係抽取方面達到了 71.64% 的準確度。 -Graph learning has attracted significant attention due to its widespread -real-world applications. Current mainstream approaches rely on text node -features and obtain initial node embeddings through shallow embedding learning -using GNNs, which shows limitations in capturing deep textual semantics. Recent -advances in Large Language Models (LLMs) have demonstrated superior -capabilities in understanding text semantics, transforming traditional text -feature processing. This paper proposes a novel framework that combines Graph -Transformer architecture with LLM-enhanced node features. Specifically, we -leverage LLMs to generate rich semantic representations of text nodes, which -are then processed by a multi-head self-attention mechanism in the Graph -Transformer to capture both local and global graph structural information. Our -model utilizes the Transformer's attention mechanism to dynamically aggregate -neighborhood information while preserving the semantic richness provided by LLM -embeddings. Experimental results demonstrate that the LLM-enhanced node -features significantly improve the performance of graph learning models on node -classification tasks. This approach shows promising results across multiple -graph learning tasks, offering a practical direction for combining graph -networks with language models. +##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine** +2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh -摘要:圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵,並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入,這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力,轉換了傳統的文本特徵處理。本文提出了一種新的框架,將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說,我們利用 LLM 來生成文本節點的豐富語義表示,然後在圖形轉換器中由多頭自我注意機制處理,以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息,同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明,LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果,為將圖形網絡與語言模型相結合提供了實用的方向。 +Generative artificial intelligence (AI) models, such as diffusion models and +OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy +and automating clinical workflows. The field has advanced rapidly, evolving +from text-only large language models for tasks such as clinical documentation +and decision support to multimodal AI systems capable of integrating diverse +data modalities, including imaging, text, and structured data, within a single +model. The diverse landscape of these technologies, along with rising interest, +highlights the need for a comprehensive review of their applications and +potential. This scoping review explores the evolution of multimodal AI, +highlighting its methods, applications, datasets, and evaluation in clinical +settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed, +IEEE Xplore, and Web of Science, prioritizing recent studies published up to +the end of 2024. After rigorous screening, 144 papers were included, revealing +key trends and challenges in this dynamic field. Our findings underscore a +shift from unimodal to multimodal approaches, driving innovations in diagnostic +support, medical report generation, drug discovery, and conversational AI. +However, critical challenges remain, including the integration of heterogeneous +data types, improving model interpretability, addressing ethical concerns, and +validating AI systems in real-world clinical settings. This review summarizes +the current state of the art, identifies critical gaps, and provides insights +to guide the development of scalable, trustworthy, and clinically impactful +multimodal AI solutions in healthcare. -##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping** -2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia +摘要:生成式人工智能 (AI) 模型,例如扩散模型和 OpenAI 的 ChatGPT,通过提高诊断准确性和自动化临床工作流程,正在改变医学领域。该领域已迅速发展,从用于临床文件编制和决策支持等任务的纯文本大型语言模型,发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣,凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变,重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南,我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science,优先考虑截至 2024 年底发表的最新研究。经过严格筛选,纳入了 144 篇论文,揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变,推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而,关键挑战仍然存在,包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术,确定了关键差距,并提供了见解,以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。 -The prototyping of computer games, particularly card games, requires -extensive human effort in creative ideation and gameplay evaluation. Recent -advances in Large Language Models (LLMs) offer opportunities to automate and -streamline these processes. However, it remains challenging for LLMs to design -novel game mechanics beyond existing databases, generate consistent gameplay -environments, and develop scalable gameplay AI for large-scale evaluations. -This paper addresses these challenges by introducing a comprehensive automated -card game prototyping framework. The approach highlights a graph-based indexing -method for generating novel game designs, an LLM-driven system for consistent -game code generation validated by gameplay records, and a gameplay AI -constructing method that uses an ensemble of LLM-generated action-value -functions optimized through self-play. These contributions aim to accelerate -card game prototyping, reduce human labor, and lower barriers to entry for game -developers. +##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** +2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano -摘要:電腦遊戲,尤其是卡牌遊戲的原型製作,需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而,LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境,以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法,用於生成新穎的遊戲設計,一個由 LLM 驅動的系統,用於一致的遊戲程式碼生成,並由遊戲記錄驗證,以及一個遊戲 AI 構建方法,該方法使用由 LLM 生成的動作值函數的集合,通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作,減少人力,並降低遊戲開發人員的進入門檻。 +This paper presents a complete explainable system that interprets a set of +data, abstracts the underlying features and describes them in a natural +language of choice. The system relies on two crucial stages: (i) identifying +emerging properties from data and transforming them into abstract concepts, and +(ii) converting these concepts into natural language. Despite the impressive +natural language generation capabilities demonstrated by Large Language Models, +their statistical nature and the intricacy of their internal mechanism still +force us to employ these techniques as black boxes, forgoing trustworthiness. +Developing an explainable pipeline for data interpretation would allow +facilitating its use in safety-critical environments like processing medical +information and allowing non-experts and visually impaired people to access +narrated information. To this end, we believe that the fields of knowledge +representation and automated reasoning research could present a valid +alternative. Expanding on prior research that tackled the first stage (i), we +focus on the second stage, named Concept2Text. Being explainable, data +translation is easily modeled through logic-based rules, once again emphasizing +the role of declarative programming in achieving AI explainability. This paper +explores a Prolog/CLP-based rewriting system to interpret concepts-articulated +in terms of classes and relations, plus common knowledge-derived from a generic +ontology, generating natural language text. Its main features include +hierarchical tree rewritings, modular multilingual generation, support for +equivalent variants across semantic, grammar, and lexical levels, and a +transparent rule-based system. We outline the architecture and demonstrate its +flexibility through some examples capable of generating numerous diverse and +equivalent rewritings based on the input concept. -##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units** -2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan +摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 -Graph Neural Networks (GNNs) are vital for learning from graph-structured -data, enabling applications in network analysis, recommendation systems, and -speech analytics. Deploying them on edge devices like client PCs and laptops -enhances real-time processing, privacy, and cloud independence. GNNs aid -Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and -enable event-based vision tasks. However, irregular memory access, sparsity, -and dynamic structures cause high latency and energy overhead on -resource-constrained devices. While modern edge processors integrate CPUs, -GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular -GNN computations. We introduce GraNNite, the first hardware-aware framework -optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN -accelerators via a structured three-step methodology: (1) enabling NPU -execution, (2) optimizing performance, and (3) trading accuracy for efficiency -gains. Step 1 employs GraphSplit for workload distribution and StaGr for static -aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts -performance using EffOp for control-heavy tasks and GraSp for sparsity -exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce -redundancy and memory transfers. Step 3 balances quality versus efficiency, -where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate -attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, -GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to -8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher -performance than CPUs and GPUs, respectively, across GNN models. +##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York** +2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu -摘要:圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要,能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置(例如用戶端電腦和筆電)上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG),並支援基於事件的視覺任務。然而,不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU,但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite,這是第一個硬體感知框架,透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行:(1) 啟用 NPU 執行,(2) 最佳化效能,以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配,並使用 StaGr 進行靜態聚合,而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能,並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率,其中 QuantGr 適用 INT8 量化,而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上,GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速,在 CPU 和 GPU 上實現了高達 8.6X 的能源增益,在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。 +Legal cases require careful logical reasoning following the laws, whereas +interactions with non-technical users must be in natural language. As an +application combining logical reasoning using Prolog and natural language +processing using large language models (LLMs), this paper presents a novel +approach and system, LogicLease, to automate the analysis of landlord-tenant +legal cases in the state of New York. LogicLease determines compliance with +relevant legal requirements by analyzing case descriptions and citing all +relevant laws. It leverages LLMs for information extraction and Prolog for +legal reasoning. By separating information extraction from legal reasoning, +LogicLease achieves greater transparency and control over the legal logic +applied to each case. We evaluate the accuracy, efficiency, and robustness of +LogicLease through a series of tests, achieving 100% accuracy and an average +processing time of 2.57 seconds. LogicLease presents advantages over +state-of-the-art LLM-based legal analysis systems by providing clear, +step-by-step reasoning, citing specific laws, and distinguishing itself by its +ability to avoid hallucinations -- a common issue in LLMs. -##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language** -2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin +摘要:法律案件需要遵循法律进行谨慎的逻辑推理,而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序,本文提出了一种新颖的方法和系统 LogicLease,以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取,并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开,LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性,实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理,引用具体法律,并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统,从而显示出优势——这是 LLM 中的常见问题。 -Recent advancements in AI for biological research focus on integrating -molecular data with natural language to accelerate drug discovery. However, the -scarcity of high-quality annotations limits progress in this area. This paper -introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework -that leverages large language models to augment existing datasets, thereby -improving AI training. We demonstrate the effectiveness of LA$^3$ by creating -an enhanced dataset, LaChEBI-20, where we systematically rewrite the -annotations of molecules from an established dataset. These rewritten -annotations preserve essential molecular information while providing more -varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 -based on a benchmark architecture to learn the mapping between molecular -representations and augmented annotations. - Experimental results on text-based *de novo* molecule generation and molecule -captioning demonstrate that LaMolT5 outperforms state-of-the-art models. -Notably, incorporating LA$^3$ leads to improvements of up to 301% over the -benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ -notable applications in *image*, *text* and *graph* tasks, affirming its -versatility and utility. +##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia** +2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott -摘要:人工智慧在生物研究上的最新進展,專注於將分子資料與自然語言整合,以加速藥物發現。然而,高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$,一個基於語言的自動註解擴充框架,它利用大型語言模型來擴充現有的資料集,進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性,我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊,同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20,我們在基於基準架構上訓練 LaMolT5,以學習分子表示和擴充註解之間的對應。 -在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明,LaMolT5 優於最先進的模型。值得注意的是,納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外,我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性,肯定了它的多功能性和實用性。 +In remote healthcare monitoring, time series representation learning reveals +critical patient behavior patterns from high-frequency data. This study +analyzes home activity data from individuals living with dementia by proposing +a two-stage, self-supervised learning approach tailored to uncover low-rank +structures. The first stage converts time-series activities into text sequences +encoded by a pre-trained language model, providing a rich, high-dimensional +latent state space using a PageRank-based method. This PageRank vector captures +latent state transitions, effectively compressing complex behaviour data into a +succinct form that enhances interpretability. This low-rank representation not +only enhances model interpretability but also facilitates clustering and +transition analysis, revealing key behavioral patterns correlated with +clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the +framework's potential in supporting cognitive status prediction, personalized +care interventions, and large-scale health monitoring. -##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment** -2502.06472v1 by Yuxing Lu, Jinzhuo Wang +摘要:在遠程醫療監控中,時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據,該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列,使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換,有效地將複雜的行為數據壓縮成簡潔的形式,從而增強了解力。此低秩表示不僅增強了模型的可解釋性,還促進了聚類和轉換分析,揭示了與臨床指標(例如 MMSE 和 ADAS-COG 分數)相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。 -Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical -for modern AI systems, but manual curation struggles to scale with the rapid -growth of scientific literature. This paper presents KARMA, a novel framework -employing multi-agent large language models (LLMs) to automate KG enrichment -through structured analysis of unstructured text. Our approach employs nine -collaborative agents, spanning entity discovery, relation extraction, schema -alignment, and conflict resolution that iteratively parse documents, verify -extracted knowledge, and integrate it into existing graph structures while -adhering to domain-specific schema. Experiments on 1,200 PubMed articles from -three different domains demonstrate the effectiveness of KARMA in knowledge -graph enrichment, with the identification of up to 38,230 new entities while -achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% -through multi-layer assessments. +##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design** +2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang -摘要:維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要,但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA,一個採用多代理大型語言模型 (LLM) 的新框架,透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理,涵蓋實體發現、關係提取、架構比對和衝突解決,這些代理會反覆分析文件、驗證提取的知識,並將其整合到現有的圖結構中,同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性,識別出多達 38,230 個新實體,同時達到 83.1% 的 LLM 驗證正確性,並透過多層評估將衝突邊緣降低了 18.6%。 +Taste peptides have emerged as promising natural flavoring agents attributed +to their unique organoleptic properties, high safety profile, and potential +health benefits. However, the de novo identification of taste peptides derived +from animal, plant, or microbial sources remains a time-consuming and +resource-intensive process, significantly impeding their widespread application +in the food industry. Here, we present TastePepAI, a comprehensive artificial +intelligence framework for customized taste peptide design and safety +assessment. As the key element of this framework, a loss-supervised adaptive +variational autoencoder (LA-VAE) is implemented to efficiently optimizes the +latent representation of sequences during training and facilitates the +generation of target peptides with desired taste profiles. Notably, our model +incorporates a novel taste-avoidance mechanism, allowing for selective flavor +exclusion. Subsequently, our in-house developed toxicity prediction algorithm +(SpepToxPred) is integrated in the framework to undergo rigorous safety +evaluation of generated peptides. Using this integrated platform, we +successfully identified 73 peptides exhibiting sweet, salty, and umami, +significantly expanding the current repertoire of taste peptides. This work +demonstrates the potential of TastePepAI in accelerating taste peptide +discovery for food applications and provides a versatile framework adaptable to +broader peptide engineering challenges. -##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs** -2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang +摘要:味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而,从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程,严重阻碍了它们在食品工业中的广泛应用。在此,我们提出了 TastePepAI,这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素,实现了损失监督自适应变分自动编码器 (LA-VAE),以在训练期间有效优化序列的潜在表示,并促进生成具有所需味觉特征的目标肽。值得注意的是,我们的模型包含了一种新颖的味觉回避机制,允许选择性排除风味。随后,我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中,以对生成的肽进行严格的安全评估。使用这个集成平台,我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽,极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力,并提供了一个适用于更广泛的肽工程挑战的多功能框架。 -Mitigating positional bias of language models (LMs) for listwise inputs is a -well-known and important problem (e.g., lost-in-the-middle). While zero-shot -order-invariant LMs have been proposed to solve this issue, their success on -practical listwise problems has been limited. In this work, as a first -contribution, we identify and overcome two limitations to make zero-shot -invariant LMs more practical: (1) training and inference distribution mismatch -arising from modifying positional ID assignments to enforce invariance, and (2) -failure to adapt to a mixture of order-invariant and sensitive inputs in -practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot -invariant LM for genuinely order-invariant inputs with minimal modifications of -positional IDs, and (2) Selective Routing, an adaptive framework that handles -both order-invariant and order-sensitive inputs in listwise tasks. On the Lost -in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU -benchmarks, we show that RoToR with Selective Routing can effectively handle -practical listwise input tasks in a zero-shot manner. +##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification** +2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan -摘要:語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題(例如,迷失在中間)。雖然已經提出零次學習順序不變的 LM 來解決這個問題,但它們在實際列表問題上的成功卻很有限。在這項工作中,作為第一個貢獻,我們找出並克服了兩個限制,讓零次學習不變的 LM 更有實用性:(1) 訓練和推論分布不匹配,這是由於修改位置 ID 分配以強制不變性所造成的,以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題,我們提出 (1) RoToR,一個零次學習不變的 LM,用於真正不變的輸入,並對位置 ID 進行最小的修改,以及 (2) 選擇性路由,一個自適應框架,用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中,我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。 +Precise segmentation and classification of cell instances are vital for +analyzing the tissue microenvironment in histology images, supporting medical +diagnosis, prognosis, treatment planning, and studies of brain +cytoarchitecture. However, the creation of high-quality annotated datasets for +training remains a major challenge. This study introduces a novel single-stage +approach (HistoSmith) for generating image-label pairs to augment histology +datasets. Unlike state-of-the-art methods that utilize diffusion models with +separate components for label and image generation, our approach employs a +latent diffusion model to learn the joint distribution of cellular layouts, +classification masks, and histology images. This model enables tailored data +generation by conditioning on user-defined parameters such as cell types, +quantities, and tissue types. Trained on the Conic H&E histopathology dataset +and the Nissl-stained CytoDArk0 dataset, the model generates realistic and +diverse labeled samples. Experimental results demonstrate improvements in cell +instance segmentation and classification, particularly for underrepresented +cell types like neutrophils in the Conic dataset. These findings underscore the +potential of our approach to address data scarcity challenges. -##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model** -2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen +摘要:精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而,建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith),用於產生影像標籤對,以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同,我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數(例如細胞類型、數量和組織類型)來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後,此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步,特別是對於 Conic 資料集中代表性不足的細胞類型,例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。 -Recent advancements in large language models (LLMs) have significantly -improved various natural language processing (NLP) tasks. Typically, LLMs are -trained to predict the next token, aligning well with many NLP tasks. However, -in knowledge graph (KG) scenarios, entities are the fundamental units and -identifying an entity requires at least several tokens. This leads to a -granularity mismatch between KGs and natural languages. To address this issue, -we propose K-ON, which integrates KG knowledge into the LLM by employing -multiple head layers for next k-step prediction. K-ON can not only generate -entity-level results in one step, but also enables contrastive loss against -entities, which is the most powerful tool in KG representation learning. -Experimental results show that K-ON outperforms state-of-the-art methods that -incorporate text and even the other modalities. +##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion** +2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì -摘要:大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常,LLM 會接受訓練以預測下一個符號,這與許多 NLP 任務非常吻合。然而,在知識圖譜 (KG) 場景中,實體是基本單位,而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題,我們提出了 K-ON,它透過採用多個頭部層進行下一個 k 步預測,將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果,還能針對實體啟用對比損失,這是 KG 表示學習中最有力的工具。實驗結果顯示,K-ON 優於將文字甚至其他方式納入考量的最新方法。 +The growing availability of longitudinal Magnetic Resonance Imaging (MRI) +datasets has facilitated Artificial Intelligence (AI)-driven modeling of +disease progression, making it possible to predict future medical scans for +individual patients. However, despite significant advancements in AI, current +methods continue to face challenges including achieving patient-specific +individualization, ensuring spatiotemporal consistency, efficiently utilizing +longitudinal data, and managing the substantial memory demands of 3D scans. To +address these challenges, we propose Brain Latent Progression (BrLP), a novel +spatiotemporal model designed to predict individual-level disease progression +in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates +in a small latent space, mitigating the computational challenges posed by +high-dimensional imaging data; (ii) it explicitly integrates subject metadata +to enhance the individualization of predictions; (iii) it incorporates prior +knowledge of disease dynamics through an auxiliary model, facilitating the +integration of longitudinal data; and (iv) it introduces the Latent Average +Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in +the predicted progression at inference time and (b) allows us to derive a +measure of the uncertainty for the prediction. We train and evaluate BrLP on +11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its +generalizability on an external test set comprising 2,257 MRIs from 962 +subjects. Our experiments compare BrLP-generated MRI scans with real follow-up +MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The +code is publicly available at: https://github.com/LemuelPuglisi/BrLP. -##### **LegalViz: Legal Text Visualization by Text To Diagram Generation** -2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita +摘要:隨著縱向磁共振影像 (MRI) 資料集的日益普及,已促進人工智慧 (AI) 驅動的疾病進程建模,讓預測個別患者的未來醫學掃描成為可能。然而,儘管 AI 有顯著進展,目前的技術仍面臨挑戰,包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料,以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰,我們提出腦潛在進程 (BrLP),這是一種新穎的時空模型,旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個:(i) 它在一個小的潛在空間中運作,減輕了高維度影像資料帶來的計算挑戰;(ii) 它明確整合受試者的元資料,以增強預測的個別化;(iii) 它透過輔助模型納入疾病動態的先驗知識,促進縱向資料的整合;(iv) 它引入了潛在平均穩定化 (LAS) 演算法,該演算法 (a) 在推論時強制預測進程中的時空一致性,(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估,並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較,與現有方法相比,展示了最先進的準確性。程式碼已公開於:https://github.com/LemuelPuglisi/BrLP。 -Legal documents including judgments and court orders require highly -sophisticated legal knowledge for understanding. To disclose expert knowledge -for non-experts, we explore the problem of visualizing legal texts with -easy-to-understand diagrams and propose a novel dataset of LegalViz with 23 -languages and 7,010 cases of legal document and visualization pairs, using the -DOT graph description language of Graphviz. LegalViz provides a simple diagram -from a complicated legal corpus identifying legal entities, transactions, legal -sources, and statements at a glance, that are essential in each judgment. In -addition, we provide new evaluation metrics for the legal diagram visualization -by considering graph structures, textual similarities, and legal contents. We -conducted empirical studies on few-shot and finetuning large language models -for generating legal diagrams and evaluated them with these metrics, including -legal content-based evaluation within 23 languages. Models trained with -LegalViz outperform existing models including GPTs, confirming the -effectiveness of our dataset. +##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data** +2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai -摘要:法律文件,包括判決和法院命令,需要高度專業的法律知識才能理解。為了向非專家揭露專家知識,我們探討了使用易於理解的圖表將法律文本視覺化的問題,並提出了一個新的 LegalViz 數據集,其中包含 23 種語言和 7,010 個法律文件和視覺化配對,使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表,可以一目了然地識別法律實體、交易、法律來源和陳述,這些在每項判決中都是必不可少的。此外,我們通過考慮圖形結構、文本相似性和法律內容,為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究,以生成法律圖表,並使用這些指標對它們進行了評估,包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型,包括 GPT,證實了我們數據集的有效性。 +The adoption of EHRs has expanded opportunities to leverage data-driven +algorithms in clinical care and research. A major bottleneck in effectively +conducting multi-institutional EHR studies is the data heterogeneity across +systems with numerous codes that either do not exist or represent different +clinical concepts across institutions. The need for data privacy further limits +the feasibility of including multi-institutional patient-level data required to +study similarities and differences across patient subgroups. To address these +challenges, we developed the GAME algorithm. Tested and validated across 7 +institutions and 2 languages, GAME integrates data in several levels: (1) at +the institutional level with knowledge graphs to establish relationships +between codes and existing knowledge sources, providing the medical context for +standard codes and their relationship to each other; (2) between institutions, +leveraging language models to determine the relationships between +institution-specific codes with established standard codes; and (3) quantifying +the strength of the relationships between codes using a graph attention +network. Jointly trained embeddings are created using transfer and federated +learning to preserve data privacy. In this study, we demonstrate the +applicability of GAME in selecting relevant features as inputs for AI-driven +algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. +We then highlight the application of GAME harmonized multi-institutional EHR +data in a study of Alzheimer's disease outcomes and suicide risk among patients +with mental health disorders, without sharing patient-level data outside +individual institutions. -##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs** -2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee +摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。 -Mental-illness stigma is a persistent social problem, hampering both -treatment-seeking and recovery. Accordingly, there is a pressing need to -understand it more clearly, but analyzing the relevant data is highly -labor-intensive. Therefore, we designed a chatbot to engage participants in -conversations; coded those conversations qualitatively with AI assistance; and, -based on those coding results, built causal knowledge graphs to decode stigma. -The results we obtained from 1,002 participants demonstrate that conversation -with our chatbot can elicit rich information about people's attitudes toward -depression, while our AI-assisted coding was strongly consistent with -human-expert coding. Our novel approach combining large language models (LLMs) -and causal knowledge graphs uncovered patterns in individual responses and -illustrated the interrelationships of psychological constructs in the dataset -as a whole. The paper also discusses these findings' implications for HCI -researchers in developing digital interventions, decomposing human -psychological constructs, and fostering inclusive attitudes. +##### **EEG Artifact Detection and Correction with Deep Autoencoders** +2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch -摘要:精神疾病的污名化是一個持續存在的社會問題,阻礙了尋求治療和康復。因此,迫切需要更清楚地了解它,但分析相關數據非常費力。因此,我們設計了一個聊天機器人,讓參與者參與對話;使用 AI 協助對這些對話進行定性編碼;並根據這些編碼結果,構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明,與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊,而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式,並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。 +EEG signals convey important information about brain activity both in healthy +and pathological conditions. However, they are inherently noisy, which poses +significant challenges for accurate analysis and interpretation. Traditional +EEG artifact removal methods, while effective, often require extensive expert +intervention. This study presents LSTEEG, a novel LSTM-based autoencoder +designed for the detection and correction of artifacts in EEG signals. +Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear +dependencies in sequential EEG data. LSTEEG demonstrates superior performance +in both artifact detection and correction tasks compared to other +state-of-the-art convolutional autoencoders. Our methodology enhances the +interpretability and utility of the autoencoder's latent space, enabling +data-driven automated artefact removal in EEG its application in downstream +tasks. This research advances the field of efficient and accurate multi-channel +EEG preprocessing, and promotes the implementation and usage of automated EEG +analysis pipelines for brain health applications. + +摘要:腦電圖訊號傳達了關於大腦活動的重要資訊,無論是在健康或病理狀況下。然而,它們本質上是有雜訊的,這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效,但通常需要大量的專家介入。本研究提出 LSTEEG,一種新穎的基於 LSTM 的自動編碼器,用於偵測和校正腦電圖訊號中的人工製品。利用深度學習,特別是 LSTM 層,LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比,LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性,讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域,並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。 -##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification** -2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya +##### **SycEval: Evaluating LLM Sycophancy** +2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo -In this paper, we address the task of semantic segmentation of legal -documents through rhetorical role classification, with a focus on Indian legal -judgments. We introduce LegalSeg, the largest annotated dataset for this task, -comprising over 7,000 documents and 1.4 million sentences, labeled with 7 -rhetorical roles. To benchmark performance, we evaluate multiple -state-of-the-art models, including Hierarchical BiLSTM-CRF, -TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and -Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an -instruction-tuned large language model. Our results demonstrate that models -incorporating broader context, structural relationships, and sequential -sentence information outperform those relying solely on sentence-level -features. Additionally, we conducted experiments using surrounding context and -predicted or actual labels of neighboring sentences to assess their impact on -classification accuracy. Despite these advancements, challenges persist in -distinguishing between closely related roles and addressing class imbalance. -Our work underscores the potential of advanced techniques for improving legal -document understanding and sets a strong foundation for future research in -legal NLP. +Large language models (LLMs) are increasingly applied in educational, +clinical, and professional settings, but their tendency for sycophancy -- +prioritizing user agreement over independent reasoning -- poses risks to +reliability. This study introduces a framework to evaluate sycophantic behavior +in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and +MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19% +of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the +lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred +in 43.52% of cases, while regressive sycophancy, leading to incorrect answers, +was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher +sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$, +$p<0.001$), particularly in computational tasks, where regressive sycophancy +increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$). +Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while +citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$, +$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI: +[77.2%, 79.8%]) regardless of context or model. These findings emphasize the +risks and opportunities of deploying LLMs in structured and dynamic domains, +offering insights into prompt programming and model optimization for safer AI +applications. -摘要:在本文中,我們通過修辭角色分類來探討法律文件的語義分段任務,重點關注印度法律判決。我們引入了 LegalSeg,這是此任務中最大的註釋資料集,包含超過 7,000 份文件和 140 萬個句子,並標記了 7 個修辭角色。為了評量效能,我們評估了多個最先進的模型,包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer,以及探索性的 RhetoricLLaMA,一種經過指令調整的大型語言模型。我們的結果表明,結合廣泛背景、結構關係和順序句子資訊的模型,表現優於僅依賴句子層級特徵的模型。此外,我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗,以評估它們對分類精度的影響。儘管有這些進展,但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力,並為法律自然語言處理的未來研究奠定了堅實的基礎。 +摘要:大型語言模型(LLM)日益應用於教育、臨床和專業領域,但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為,涉及 AMPS(數學)和 MedQuad(醫療建議)數據集。在 58.19% 的案例中觀察到了趨炎附勢行為,其中 Gemini 表現出最高比率(62.47%),而 ChatGPT 最低(56.71%)。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中,而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率(61.75% 對 56.52%,Z=5.87,p<0.001),特別是在計算任務中,其中退步式趨炎附勢顯著增加(先發制人:8.13%,上下文:3.54%,p<0.001)。簡單的反駁最大化了漸進式趨炎附勢(Z=6.59,p<0.001),而基於引用的反駁表現出最高的退步式比率(Z=6.59,p<0.001)。趨炎附勢行為表現出很高的持續性(78.5%,95% CI:[77.2%,79.8%]),無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇,為更安全的 AI 應用提供了提示編程和模型優化的見解。 -##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning** -2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong +##### **Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models** +2502.09659v1 by Hasin Rehana, Jie Zheng, Leo Yeh, Benu Bansal, Nur Bengisu Çam, Christianah Jemiyo, Brett McGregor, Arzucan Özgür, Yongqun He, Junguk Hur -Developing intelligent agents for long-term cooperation in dynamic open-world -scenarios is a major challenge in multi-agent systems. Traditional Multi-agent -Reinforcement Learning (MARL) frameworks like centralized training -decentralized execution (CTDE) struggle with scalability and flexibility. They -require centralized long-term planning, which is difficult without custom -reward functions, and face challenges in processing multi-modal data. CTDE -approaches also assume fixed cooperation strategies, making them impractical in -dynamic environments where agents need to adapt and plan independently. To -address decentralized multi-agent cooperation, we propose Decentralized -Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in -a novel Multi-agent Crafter environment. Our generative agents, powered by -Large Language Models (LLMs), are more scalable than traditional MARL agents by -leveraging external knowledge and language for long-term planning and -reasoning. Instead of fully sharing information from all past experiences, -DAMCS introduces a multi-modal memory system organized as a hierarchical -knowledge graph and a structured communication protocol to optimize agent -cooperation. This allows agents to reason from past interactions and share -relevant information efficiently. Experiments on novel multi-agent open-world -tasks show that DAMCS outperforms both MARL and LLM baselines in task -efficiency and collaboration. Compared to single-agent scenarios, the two-agent -scenario achieves the same goal with 63% fewer steps, and the six-agent -scenario with 74% fewer steps, highlighting the importance of adaptive memory -and structured communication in achieving long-term goals. We publicly release -our project at: https://happyeureka.github.io/damcs. +Motivation: An adjuvant is a chemical incorporated into vaccines that +enhances their efficacy by improving the immune response. Identifying adjuvant +names from cancer vaccine studies is essential for furthering research and +enhancing immunotherapies. However, the manual curation from the constantly +expanding biomedical literature poses significant challenges. This study +explores the automated recognition of vaccine adjuvant names using Large +Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) +and Large Language Model Meta AI (Llama). Methods: We utilized two datasets: 97 +clinical trial records from AdjuvareDB and 290 abstracts annotated with the +Vaccine Adjuvant Compendium (VAC). GPT-4o and Llama 3.2 were employed in +zero-shot and few-shot learning paradigms with up to four examples per prompt. +Prompts explicitly targeted adjuvant names, testing the impact of contextual +information such as substances or interventions. Outputs underwent automated +and manual validation for accuracy and consistency. Results: GPT-4o attained +100% Precision across all situations while exhibiting notable improve in Recall +and F1-scores, particularly with incorporating interventions. On the VAC +dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, +surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o +reached an F1-score of 81.67% for three-shot prompting with interventions, +surpassing Llama-3.2-3 B's maximum F1-score of 65.62%. Conclusion: Our findings +demonstrate that LLMs excel at identifying adjuvant names, including rare +variations of naming representation. This study emphasizes the capability of +LLMs to enhance cancer vaccine development by efficiently extracting insights. +Future work aims to broaden the framework to encompass various biomedical +literature and enhance model generalizability across various vaccines and +adjuvants. -摘要:在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架,例如集中式訓練去中心化執行 (CTDE),在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃,這在沒有自訂獎勵函數的情況下很難執行,並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略,這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題,我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援,透過利用外部知識和語言進行長期規劃和推理,比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊,而是引入了多模式記憶體系統,該系統組織成階層式知識圖譜和結構化通訊協定,以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明,DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比,雙重代理情境以少 63% 的步驟達成相同的目標,而六重代理情境則以少 74% 的步驟達成目標,突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於:https://happyeureka.github.io/damcs。 +摘要:動機:佐劑是一種加入疫苗的化學物質,能藉由改善免疫反應來提升疫苗的效力。從癌症疫苗研究中找出佐劑名稱對於推進研究和改善免疫療法至關重要。然而,從不斷擴展的生物醫學文獻中手動整理會造成重大挑戰。本研究探討使用大型語言模型 (LLM),特別是生成式預訓練Transformer (GPT) 和大型語言模型 Meta AI (Llama) 來自動辨識疫苗佐劑名稱。方法:我們使用兩個資料集:來自 AdjuvareDB 的 97 份臨床試驗記錄和 290 篇標註了疫苗佐劑彙編 (VAC) 的摘要。GPT-4o 和 Llama 3.2 被用於零次學習和少量學習範例,每個提示最多有四個範例。提示明確鎖定佐劑名稱,測試物質或介入措施等背景資訊的影響。輸出經過自動和手動驗證,以確保準確性和一致性。結果:GPT-4o 在所有情況下都達到 100% 的準確率,同時在召回率和 F1 分數上表現出顯著的進步,特別是在納入介入措施的情況下。在 VAC 資料集上,GPT-4o 在有介入措施的情況下達到 77.32% 的最高 F1 分數,比 Llama-3.2-3B 高出約 2%。在 AdjuvareDB 資料集上,GPT-4o 在有介入措施的三次提示中達到 81.67% 的 F1 分數,超過 Llama-3.2-3 B 的最高 F1 分數 65.62%。結論:我們的研究結果表明,LLM 在辨識佐劑名稱方面表現出色,包括命名表示的罕見變異。本研究強調了 LLM 在有效提取見解方面增強癌症疫苗開發的能力。未來的研究工作旨在擴大架構,涵蓋各種生物醫學文獻,並增強模型在各種疫苗和佐劑中的泛化能力。 -##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation** -2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang +##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?** +2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace -Graphs are able to model interconnected entities in many online services, -supporting a wide range of applications on the Web. This raises an important -question: How can we train a graph foundational model on multiple source -domains and adapt to an unseen target domain? A major obstacle is that graphs -from different domains often exhibit divergent characteristics. Some studies -leverage large language models to align multiple domains based on textual -descriptions associated with the graphs, limiting their applicability to -text-attributed graphs. For text-free graphs, a few recent works attempt to -align different feature distributions across domains, while generally -neglecting structural differences. In this work, we propose a novel Structure -Alignment framework for text-free Multi-domain Graph Pre-Training and -cross-domain adaptation (SAMGPT). It is designed to learn multi-domain -knowledge from graphs originating in multiple source domains, which can then be -adapted to address applications in an unseen target domain. Specifically, we -introduce a set of structure tokens to harmonize structure-based aggregation -across source domains during the pre-training phase. Next, for cross-domain -adaptation, we design dual prompts, namely, holistic prompts and specific -prompts, which adapt unified multi-domain structural knowledge and -fine-grained, domain-specific information, respectively, to a target domain. -Finally, we conduct comprehensive experiments on seven public datasets to -evaluate and analyze the effectiveness of SAMGPT. +Medical research faces well-documented challenges in translating novel +treatments into clinical practice. Publishing incentives encourage researchers +to present "positive" findings, even when empirical results are equivocal. +Consequently, it is well-documented that authors often spin study results, +especially in article abstracts. Such spin can influence clinician +interpretation of evidence and may affect patient care decisions. In this +study, we ask whether the interpretation of trial results offered by Large +Language Models (LLMs) is similarly affected by spin. This is important since +LLMs are increasingly being used to trawl through and synthesize published +medical evidence. We evaluated 22 LLMs and found that they are across the board +more susceptible to spin than humans. They might also propagate spin into their +outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into +plain language summaries that they generate. We also find, however, that LLMs +are generally capable of recognizing spin, and can be prompted in a way to +mitigate spin's impact on LLM outputs. -摘要:圖表能夠在許多線上服務中對相互關聯的實體進行建模, -支援網路上廣泛的應用程式。這提出了重要的問題:我們如何針對多個來源網域訓練圖表基礎模型,並適應未見過的目標網域?一個主要的障礙是,來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型,根據與圖表相關的文字描述,對齊多個網域,限制其適用性於有文字屬性的圖表。對於沒有文字的圖表,最近的一些作品嘗試對齊跨網域的不同特徵分佈,同時通常忽略結構上的差異。在這項工作中,我們提出了一個新的結構對齊框架,用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識,然後可以適應於未見過的目標網域中的應用程式。具體來說,我們引入了一組結構化代碼,以在預訓練階段,調和跨來源網域的基於結構的聚合。接下來,對於跨網域適應,我們設計了雙重提示,即整體提示和具體提示,分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後,我們在七個公共資料集上進行了全面的實驗,以評估和分析 SAMGPT 的有效性。 +摘要:醫學研究在將新穎療法轉化為臨床實務上,面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現,即使經驗結果模稜兩可。因此,有據可查的是,作者經常扭曲研究結果,特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋,並可能影響病患照護決策。在本研究中,我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據,因此這點非常重要。我們評估了 22 個 LLM,發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中:例如,我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而,我們也發現 LLM 通常有能力辨認扭曲,而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。 -##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints** -2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang +##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating** +2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri -In-context learning (ICL) effectively conditions large language models (LLMs) -for molecular tasks, such as property prediction and molecule captioning, by -embedding carefully selected demonstration examples into the input prompt. This -approach avoids the computational overhead of extensive pertaining and -fine-tuning. However, current prompt retrieval methods for molecular tasks have -relied on molecule feature similarity, such as Morgan fingerprints, which do -not adequately capture the global molecular and atom-binding relationships. As -a result, these methods fail to represent the full complexity of molecular -structures during inference. Moreover, small-to-medium-sized LLMs, which offer -simpler deployment requirements in specialized systems, have remained largely -unexplored in the molecular ICL literature. To address these gaps, we propose a -self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context -learning, which aligns global molecular structures, represented by graph neural -networks (GNNs), with textual captions (descriptions) while leveraging local -feature similarity through Morgan fingerprints. In addition, we introduce a -Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to -optimize input prompt demonstration samples. Our experimental findings using -diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL -retrieval methods across all tasks by up to 45%. +This paper presents a novel Natural Language Processing (NLP) framework for +enhancing medical diagnosis through the integration of advanced techniques in +data augmentation, feature extraction, and classification. The proposed +approach employs back-translation to generate diverse paraphrased datasets, +improving robustness and mitigating overfitting in classification tasks. +Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with +Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained +contextual and positional relationships, dynamically adjusting the influence of +positional information based on semantic context to produce high-quality text +embeddings. For classification, an Attention-Based Feedforward Neural Network +(ABFNN) is utilized, effectively focusing on the most relevant features to +improve decision-making accuracy. Applied to the classification of symptoms, +clinical notes, and other medical texts, this architecture demonstrates its +ability to address the complexities of medical data. The combination of data +augmentation, contextual embedding generation, and advanced classification +mechanisms offers a robust and accurate diagnostic tool, with potential +applications in automated medical diagnosis and clinical decision support. This +method demonstrates the effectiveness of the proposed NLP framework for medical +diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of +99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only +underscore the model's robust performance in classifying medical texts with +exceptional precision and reliability but also highlight its superiority over +existing methods, making it a highly promising tool for automated diagnostic +systems. -摘要:情境學習 (ICL) 有效地調整大型語言模型 (LLM),以執行分子任務,例如屬性預測和分子標題,方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而,目前針對分子任務的提示檢索方法依賴於分子特徵相似性,例如 Morgan 指紋,而無法充分捕捉全局分子和原子鍵結關係。因此,這些方法無法在推理過程中表示分子結構的完整複雜性。此外,在專業系統中提供更簡單部署需求的小到中型的 LLM,在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距,我們提出了一種自我監督學習技術,GAMIC(圖形對齊分子情境學習),它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題(描述)對齊,同時透過 Morgan 指紋利用局部特徵相似性。此外,我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法,以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示,GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法,最多可達 45%。 +摘要:本文提出了一個創新的自然語言處理 (NLP) 框架,透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集,提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa),這個模型捕捉細緻的脈絡和位置關係,根據語意脈絡動態調整位置資訊的影響,以產生高品質的文字嵌入。在分類方面,利用基於注意力的前饋神經網路 (ABFNN),有效地關注最相關的特徵,以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類,此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具,在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性,以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數,取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性,也突顯了它優於現有方法的優越性,使其成為自動化診斷系統中極具前景的工具。 -##### **Knowledge Graph-Guided Retrieval Augmented Generation** -2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu +##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension** +2502.07752v2 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds -Retrieval-augmented generation (RAG) has emerged as a promising technology -for addressing hallucination issues in the responses generated by large -language models (LLMs). Existing studies on RAG primarily focus on applying -semantic-based approaches to retrieve isolated relevant chunks, which ignore -their intrinsic relationships. In this paper, we propose a novel Knowledge -Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes -knowledge graphs (KGs) to provide fact-level relationships between chunks, -improving the diversity and coherence of the retrieved results. Specifically, -after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG -employs a KG-guided chunk expansion process and a KG-based chunk organization -process to deliver relevant and important knowledge in well-organized -paragraphs. Extensive experiments conducted on the HotpotQA dataset and its -variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based -approaches, in terms of both response quality and retrieval quality. +Designing efficient optimizers for large language models (LLMs) with +low-memory requirements and fast convergence is an important and challenging +problem. This paper makes a step towards the systematic design of such +optimizers through the lens of structured Fisher information matrix (FIM) +approximation. We show that many state-of-the-art efficient optimizers can be +viewed as solutions to FIM approximation (under the Frobenius norm) with +specific structural assumptions. Building on these insights, we propose two +design recommendations of practical efficient optimizers for LLMs, involving +the careful selection of structural assumptions to balance generality and +efficiency, and enhancing memory efficiency of optimizers with general +structures through a novel low-rank extension framework. We demonstrate how to +use each design approach by deriving new memory-efficient optimizers: Row and +Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation +(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the +effectiveness, showing faster and better convergence than existing +memory-efficient baselines and Adam with little memory overhead. Notably, Alice +achieves better than 2x faster convergence over Adam, while RACS delivers +strong performance on the 1B model with SGD-like memory. -摘要:檢索增強生成 (RAG) 已成為一項有前途的技術,用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊,而忽略它們的內在關係。在本文中,我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架,它利用知識圖表 (KG) 來提供區塊之間的事實層級關係,從而提高檢索結果的多樣性和一致性。具體來說,在執行基於語義的檢索以提供種子區塊後,KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序,以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。 +摘要:設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的觀點,朝著系統化設計此類最佳化器邁出了一步。我們證明許多最先進的高效最佳化器可以視為 FIM 近似(在 Frobenius 範數下)的解,並具有特定的結構假設。基於這些見解,我們提出了 LLM 的兩個實用高效最佳化器設計建議,包括仔細選擇結構假設以平衡通用性和效率,以及透過新穎的低秩擴充框架增強一般結構最佳化器的記憶體效率。我們透過推導新的記憶體高效最佳化器來展示如何使用每種設計方法:列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練(高達 1B 參數)上的實驗驗證了其有效性,顯示比現有的記憶體高效基準和 Adam 更快且更好的收斂,且記憶體開銷很小。值得注意的是,Alice 的收斂速度比 Adam 快 2 倍以上,而 RACS 則在 1B 模型上提供類似 SGD 的記憶體的強勁效能。 -##### **Can Large Language Models Understand Intermediate Representations?** -2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan +##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation** +2502.07516v2 by Raman Dutt -Intermediate Representations (IRs) are essential in compiler design and -program analysis, yet their comprehension by Large Language Models (LLMs) -remains underexplored. This paper presents a pioneering empirical study to -investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA -3.1, and Code Llama, in understanding IRs. We analyze their performance across -four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code -summarization, and execution reasoning. Our results indicate that while LLMs -demonstrate competence in parsing IR syntax and recognizing high-level -structures, they struggle with control flow reasoning, execution semantics, and -loop handling. Specifically, they often misinterpret branching instructions, -omit critical IR operations, and rely on heuristic-based reasoning, leading to -errors in CFG reconstruction, IR decompilation, and execution reasoning. The -study underscores the necessity for IR-specific enhancements in LLMs, -recommending fine-tuning on structured IR datasets and integration of explicit -control flow models to augment their comprehension and handling of IR-related -tasks. +Generative models, particularly text-to-image (T2I) diffusion models, play a +crucial role in medical image analysis. However, these models are prone to +training data memorization, posing significant risks to patient privacy. +Synthetic chest X-ray generation is one of the most common applications in +medical image analysis with the MIMIC-CXR dataset serving as the primary data +repository for this task. This study presents the first systematic attempt to +identify prompts and text tokens in MIMIC-CXR that contribute the most to +training data memorization. Our analysis reveals two unexpected findings: (1) +prompts containing traces of de-identification procedures (markers introduced +to hide Protected Health Information) are the most memorized, and (2) among all +tokens, de-identification markers contribute the most towards memorization. +This highlights a broader issue with the standard anonymization practices and +T2I synthesis with MIMIC-CXR. To exacerbate, existing inference-time +memorization mitigation strategies are ineffective and fail to sufficiently +reduce the model's reliance on memorized text tokens. On this front, we propose +actionable strategies for different stakeholders to enhance privacy and improve +the reliability of generative models in medical imaging. Finally, our results +provide a foundation for future work on developing and benchmarking +memorization mitigation techniques for synthetic chest X-ray generation using +the MIMIC-CXR dataset. The anonymized code is available at +https://anonymous.4open.science/r/diffusion_memorization-8011/ -摘要:中間表徵 (IR) 在編譯器設計和程式分析中至關重要,但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究,以探討 LLM(包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama)理解 IR 的能力。我們分析了它們在四項任務中的表現:控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明,儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力,但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言,它們經常誤解分支指令、省略關鍵 IR 操作,並依賴於基於啟發式的推理,導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性,建議對結構化的 IR 資料集進行微調,並整合明確的控制流程模型,以增強其對 IR 相關任務的理解和處理。 +摘要:生成模型,尤其是文本到影像 (T2I) 擴散模型在醫學影像分析中扮演著至關重要的角色。然而,這些模型容易訓練資料記憶,對病患隱私構成重大風險。合成胸部 X 光影像生成是醫學影像分析中最常見的應用之一,而 MIMIC-CXR 資料集則作為此任務的主要資料儲存庫。本研究提出了第一個系統化的嘗試,以識別 MIMIC-CXR 中對訓練資料記憶貢獻最大的提示和文字代碼。我們的分析揭示了兩個出乎意料的發現:(1) 包含去識別程序痕跡的提示(用於隱藏受保護健康資訊的標記)是最容易被記憶的,以及 (2) 在所有代碼中,去識別標記對記憶的貢獻最大。這突顯了標準匿名化實務和使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。更糟的是,現有的推論時間記憶減緩策略無效,無法充分降低模型對記憶文字代碼的依賴。在這個方面,我們針對不同的利害關係人提出可行的策略,以增強隱私和改善生成模型在醫學影像中的可靠性。最後,我們的結果為未來開發和評量使用 MIMIC-CXR 資料集進行合成胸部 X 光影像生成的記憶減緩技術奠定了基礎。已匿名化的程式碼可在 https://anonymous.4open.science/r/diffusion_memorization-8011/ 取得。 -##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?** -2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen +##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level** +2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo -Long-context large language models (LLMs) have recently shown strong -performance in information retrieval and long-document QA. However, to tackle -the most challenging intellectual problems, LLMs must reason effectively in -long and complex contexts (e.g., frontier mathematical research). Studying how -LLMs handle increasing reasoning complexity and context length is essential, -yet existing benchmarks lack a solid basis for quantitative evaluation. -Inspired by the abstraction of GSM-8K problems as computational graphs, and the -ability to introduce noise by adding unnecessary nodes and edges, we develop a -grade school math problem generator capable of producing arithmetic problems -with infinite difficulty and context length under fine-grained control. Using -our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate -existing LLMs. We find a consistent sigmoid decline in reasoning performance as -complexity increases, along with a systematic inference scaling trend: -exponentially increasing inference computation yields only linear performance -gains. These findings underscore the fundamental limitations of current -long-context LLMs and the key challenges in scaling reasoning capabilities. Our -GSM-Infinite benchmark provides a scalable and controllable testbed for -systematically studying and advancing LLM reasoning in long and complex -contexts. +Chronic kidney disease (CKD) is a major global health issue, affecting over +10% of the population and causing significant mortality. While kidney biopsy +remains the gold standard for CKD diagnosis and treatment, the lack of +comprehensive benchmarks for kidney pathology segmentation hinders progress in +the field. To address this, we organized the Kidney Pathology Image +Segmentation (KPIs) Challenge, introducing a dataset that incorporates +preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ +Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes +two tasks, patch-level segmentation and whole slide image segmentation and +detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. +By encouraging innovative segmentation methods that adapt to diverse CKD models +and tissue conditions, the KPIs Challenge aims to advance kidney pathology +analysis, establish new benchmarks, and enable precise, large-scale +quantification for disease research and diagnosis. -摘要:長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而,若要解決最具挑戰性的智力問題,LLM 必須在長且複雜的脈絡中有效推理(例如,前沿數學研究)。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要,但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發,以及透過加入不必要的節點和邊緣來引入雜訊的能力,我們開發了一個小學數學問題產生器,能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準,我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降,並伴隨著系統性的推論縮放趨勢:指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制,以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台,用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。 +摘要:慢性腎臟病 (CKD) 是全球主要的健康問題,影響超過 +10% 的人口,並造成顯著的死亡率。雖然腎臟活檢 +仍然是 CKD 診斷和治療的黃金標準,但缺乏 +腎臟病理學分割的全面基準阻礙了該領域的進展。 +為了解決這個問題,我們組織了腎臟病理影像 +分割 (KPIs) 挑戰,引入了包含超過 10,000 個註解的 +CKD 臨床前嚙齒動物模型的資料集,這些註解來自 60 多個 +週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括 +兩個任務,修補層級分割和全幻燈片影像分割和 +偵測,使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。 +通過鼓勵創新的分割方法來適應不同的 CKD 模型 +和組織條件,KPIs 挑戰旨在推進腎臟病理 +分析,建立新的基準,並實現精確、大規模的 +疾病研究和診斷量化。 -##### **Causality can systematically address the monsters under the bench(marks)** -2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf +##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer** +2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu -Effective and reliable evaluation is essential for advancing empirical -machine learning. However, the increasing accessibility of generalist models -and the progress towards ever more complex, high-level tasks make systematic -evaluation more challenging. Benchmarks are plagued by various biases, -artifacts, or leakage, while models may behave unreliably due to poorly -explored failure modes. Haphazard treatments and inconsistent formulations of -such "monsters" can contribute to a duplication of efforts, a lack of trust in -results, and unsupported inferences. In this position paper, we argue causality -offers an ideal framework to systematically address these challenges. By making -causal assumptions in an approach explicit, we can faithfully model phenomena, -formulate testable hypotheses with explanatory power, and leverage principled -tools for analysis. To make causal model design more accessible, we identify -several useful Common Abstract Topologies (CATs) in causal graphs which help -gain insight into the reasoning abilities in large language models. Through a -series of case studies, we demonstrate how the precise yet pragmatic language -of causality clarifies the strengths and limitations of a method and inspires -new approaches for systematic progress. +Early prediction of pediatric cardiac arrest (CA) is critical for timely +intervention in high-risk intensive care settings. We introduce PedCA-FT, a +novel transformer-based framework that fuses tabular view of EHR with the +derived textual view of EHR to fully unleash the interactions of +high-dimensional risk factors and their dynamics. By employing dedicated +transformer modules for each modality view, PedCA-FT captures complex temporal +and contextual patterns to produce robust CA risk estimates. Evaluated on a +curated pediatric cohort from the CHOA-CICU database, our approach outperforms +ten other artificial intelligence models across five key performance metrics +and identifies clinically meaningful risk factors. These findings underscore +the potential of multimodal fusion techniques to enhance early CA detection and +improve patient care. -摘要:有效的、可靠的評估對於推進經驗機器學習至關重要。然而,一般化模型的可及性日益提高,以及朝著更複雜、更高級別任務的進展,使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾,而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中,我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設,我們可以忠實地模擬現象,制定具有解釋力的可測試假設,並利用原則性的分析工具。為了使因果模型設計更易於使用,我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT),有助於深入了解大型語言模型中的推理能力。通過一系列案例研究,我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點,並激發系統進展的新方法。 +摘要:早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT,一個新穎的基於轉換器的框架,它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起,以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組,PedCA-FT 捕獲複雜的時間和上下文模式,以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估,我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型,並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。 -##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures** -2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha +##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals** +2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari -Large Language Models (LLMs) have demonstrated impressive reasoning -capabilities, yet their performance is highly dependent on the prompting -strategy and model scale. While reinforcement learning and fine-tuning have -been deployed to boost reasoning, these approaches incur substantial -computational and data overhead. In this work, we introduce Adaptive Graph of -Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM -reasoning solely at test time. Rather than relying on fixed-step methods like -Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes -complex queries into structured subproblems, forming an dynamic directed -acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding -only those subproblems that require further analysis, AGoT unifies the -strengths of chain, tree, and graph paradigms into a cohesive framework that -allocates computation where it is most needed. We validate our approach on -diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and -mathematical problem-solving, achieving up to 46.2% improvement on scientific -reasoning tasks (GPQA) - comparable to gains achieved through computationally -intensive reinforcement learning approaches and outperforming state-of-the-art -iterative approaches. These results suggest that dynamic decomposition and -structured recursion offer a scalable, cost-effective alternative to -post-training modifications, paving the way for more robust, general-purpose -reasoning in LLMs. +Counterfactual explanations in medical imaging are critical for understanding +the predictions made by deep learning models. We extend the Latent Shift +counterfactual generation method from 2D applications to 3D computed tomography +(CT) scans. We address the challenges associated with 3D data, such as limited +training samples and high memory demands, by implementing a slice-based +approach. This method leverages a 2D encoder trained on CT slices, which are +subsequently combined to maintain 3D context. We demonstrate this technique on +two models for clinical phenotype prediction and lung segmentation. Our +approach is both memory-efficient and effective for generating interpretable +counterfactuals in high-resolution 3D medical imaging. -摘要:大型語言模型 (LLM) 已展現令人印象深刻的推理能力,但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理,但這些方法會造成大量的運算和資料開銷。在這項工作中,我們引入了「適應性思考圖」(AGoT),一個動態的、基於圖形的推論架構,它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法,而是遞迴地將複雜的查詢分解成結構化的子問題,形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題,AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中,將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法,在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進,這與透過運算密集的強化學習方法所獲得的增益相當,並且優於最先進的迭代方法。這些結果表明,動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案,用於訓練後修改,為 LLM 中更強健、更通用的推理鋪平了道路。 +摘要:反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法,來解決與 3D 資料相關的挑戰,例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器,隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實,既節省記憶體又有效。 -##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics** -2502.05239v1 by Hussam Ghanem, Christophe Cruz +##### **Interactive Data Harmonization with LLM Agents** +2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire -Recent advancements in large language models have demonstrated significant -potential in the automated construction of knowledge graphs from unstructured -text. This paper builds upon our previous work [16], which evaluated various -models using metrics like precision, recall, F1 score, triple matching, and -graph matching, and introduces a refined approach to address the critical -issues of hallucination and omission. We propose an enhanced evaluation -framework incorporating BERTScore for graph similarity, setting a practical -threshold of 95% for graph matching. Our experiments focus on the Mistral -model, comparing its original and fine-tuned versions in zero-shot and few-shot -settings. We further extend our experiments using examples from the KELM-sub -training dataset, illustrating that the fine-tuned model significantly improves -knowledge graph construction accuracy while reducing the exact hallucination -and omission. However, our findings also reveal that the fine-tuned models -perform worse in generalization tasks on the KELM-sub dataset. This study -underscores the importance of comprehensive evaluation metrics in advancing the -state-of-the-art in knowledge graph construction from textual data. +Data harmonization is an essential task that entails integrating datasets +from diverse sources. Despite years of research in this area, it remains a +time-consuming and challenging task due to schema mismatches, varying +terminologies, and differences in data collection methodologies. This paper +presents the case for agentic data harmonization as a means to both empower +experts to harmonize their data and to streamline the process. We introduce +Harmonia, a system that combines LLM-based reasoning, an interactive user +interface, and a library of data harmonization primitives to automate the +synthesis of data harmonization pipelines. We demonstrate Harmonia in a +clinical data harmonization scenario, where it helps to interactively create +reusable pipelines that map datasets to a standard format. Finally, we discuss +challenges and open problems, and suggest research directions for advancing our +vision. -摘要:大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上,該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型,並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架,結合 BERTScore 來進行圖形相似性,並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上,比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗,說明微調後的模型顯著提高了知識圖譜建構的準確度,同時減少了精確的幻覺和遺漏。然而,我們的研究結果也顯示,微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。 +摘要:資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷,但由於架構不匹配、術語不同,以及資料收集方法的差異,它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和,作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia,一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統,以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia,它有助於互動式建立可重複使用的管線,將資料集對應至標準格式。最後,我們討論挑戰和開放性問題,並建議研究方向以推進我們的願景。 -##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research** -2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu +##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML** +2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani -We introduce Agentic Reasoning, a framework that enhances large language -model (LLM) reasoning by integrating external tool-using agents. Unlike -conventional LLM-based reasoning approaches, which rely solely on internal -inference, Agentic Reasoning dynamically engages web search, code execution, -and structured reasoning-context memory to solve complex problems requiring -deep research and multi-step logical deduction. Our framework introduces the -Mind Map agent, which constructs a structured knowledge graph to track logical -relationships, improving deductive reasoning. Additionally, the integration of -web-search and coding agents enables real-time retrieval and computational -analysis, enhancing reasoning accuracy and decision-making. Evaluations on -PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks -demonstrate that our approach significantly outperforms existing models, -including leading retrieval-augmented generation (RAG) systems and -closed-source LLMs. Moreover, our results indicate that agentic reasoning -improves expert-level knowledge synthesis, test-time scalability, and -structured problem-solving. The code is at: -https://github.com/theworldofagents/Agentic-Reasoning. +Machine learning (ML) is transforming healthcare by enabling predictive +analytics, personalized treatments, and improved patient outcomes. However, +traditional ML workflows require specialized skills, infrastructure, and +resources, limiting accessibility for many healthcare professionals. This paper +explores how Google Cloud's BigQuery ML simplifies the development and +deployment of ML models using SQL, reducing technical barriers. Through a case +study on diabetes prediction using the Diabetes Health Indicators Dataset, we +evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep +Neural Network (DNN). Our results demonstrate that the Boosted Tree model +achieves the highest performance, making it highly effective for diabetes +prediction. This study highlights BigQuery ML's role in democratizing machine +learning by providing a scalable, efficient, and accessible solution for +healthcare analytics. -摘要:我們引入了代理推理,一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同,代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理,它建立一個結構化的知識圖譜來追蹤邏輯關係,改善演繹推理。此外,整合網路搜尋和編碼代理能進行即時擷取和運算分析,增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示,我們的做法明顯優於現有模型,包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外,我們的結果顯示,代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在:https://github.com/theworldofagents/Agentic-Reasoning。 +摘要:機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果,正在轉型醫療保健。然而,傳統的 ML 工作流程需要專業技能、基礎設施和資源,限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署,降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究,我們評估了三個預測模型:邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明,提升樹模型達到了最高的效能,使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色,提供可擴充、有效率且可存取的醫療保健分析解決方案。 -##### **Position-aware Automatic Circuit Discovery** -2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov +##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements** +2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen -A widely used strategy to discover and understand language model mechanisms -is circuit analysis. A circuit is a minimal subgraph of a model's computation -graph that executes a specific task. We identify a gap in existing circuit -discovery methods: they assume circuits are position-invariant, treating model -components as equally relevant across input positions. This limits their -ability to capture cross-positional interactions or mechanisms that vary across -positions. To address this gap, we propose two improvements to incorporate -positionality into circuits, even on tasks containing variable-length examples. -First, we extend edge attribution patching, a gradient-based method for circuit -discovery, to differentiate between token positions. Second, we introduce the -concept of a dataset schema, which defines token spans with similar semantics -across examples, enabling position-aware circuit discovery in datasets with -variable length examples. We additionally develop an automated pipeline for -schema generation and application using large language models. Our approach -enables fully automated discovery of position-sensitive circuits, yielding -better trade-offs between circuit size and faithfulness compared to prior work. +Despite over a decade of legislative efforts to address modern slavery in the +supply chains of large corporations, the effectiveness of government oversight +remains hampered by the challenge of scrutinizing thousands of statements +annually. While Large Language Models (LLMs) can be considered a well +established solution for the automatic analysis and summarization of documents, +recognizing concrete modern slavery countermeasures taken by companies and +differentiating those from vague claims remains a challenging task. To help +evaluate and fine-tune LLMs for the assessment of corporate statements, we +introduce a dataset composed of 5,731 modern slavery statements taken from the +Australian Modern Slavery Register and annotated at the sentence level. This +paper details the construction steps for the dataset that include the careful +design of annotation specifications, the selection and preprocessing of +statements, and the creation of high-quality annotation subsets for effective +model evaluations. To demonstrate our dataset's utility, we propose a machine +learning methodology for the detection of sentences relevant to mandatory +reporting requirements set by the Australian Modern Slavery Act. We then follow +this methodology to benchmark modern language models under zero-shot and +supervised learning settings. -摘要:廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖,可執行特定任務。我們找出電路發現方法中的一個缺口:它們假設電路與位置無關,將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口,我們提出兩項改進,將位置性納入電路中,即使在包含變長範例的任務中也是如此。首先,我們擴充邊緣屬性修補,一種基於梯度的電路發現方法,以區分符號位置。其次,我們引入了資料集架構的概念,它定義了在範例中具有類似語義的符號跨距,使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外,我們開發了一個自動化管線,用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化,與先前的研究相比,在電路大小和忠實度之間產生了更好的權衡。 +摘要:儘管立法努力超過十年,旨在解決大型企業供應鏈中的現代奴隸制,但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型(LLM)可以被認為是文件自動分析和摘要的完善解決方案,但要辨識公司採取的具體現代奴隸制對策,並將其與含糊的聲明區分開來,仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明,我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集,這些聲明取自澳洲現代奴隸制註冊處,並在句子層級進行註解。本文詳細說明了資料集的建構步驟,其中包括註解規格的仔細設計、聲明的選擇和預處理,以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用,我們提出了一種機器學習方法,用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後,我們遵循這種方法,在零次學習和監督學習設定下對現代語言模型進行基準測試。 -##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems** -2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister +##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium** +2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour -We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by -jointly optimizing model roles and weights. We represent multi-LLM systems as -directed acyclic graphs (DAGs) of LLMs with topological message passing for -collaborative generation. Given a pool of LLM experts and a utility function, -Heterogeneous Swarms employs two iterative steps: role-step and weight-step. -For role-step, we interpret model roles as learning a DAG that specifies the -flow of inputs and outputs between LLMs. Starting from a swarm of random -continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs -in topological order, evaluate on the utility function (e.g. accuracy on a -task), and optimize the adjacency matrices with particle swarm optimization -based on the utility score. For weight-step, we assess the contribution of -individual LLMs in the multi-LLM systems and optimize model weights with swarm -intelligence. We propose JFK-score to quantify the individual contribution of -each LLM in the best-found DAG of the role-step, then optimize model weights -with particle swarm optimization based on the JFK-score. Experiments -demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based -baselines by 18.5% on average across 12 tasks. Further analysis reveals that -Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles -and substantial collaborative gains, and benefits from the diversity of -language models. +The fourth Machine Learning for Health (ML4H) symposium was held in person on +December 15th and 16th, 2024, in the traditional, ancestral, and unceded +territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, +British Columbia, Canada. The symposium included research roundtable sessions +to foster discussions between participants and senior researchers on timely and +relevant topics for the ML4H community. The organization of the research +roundtables at the conference involved 13 senior and 27 junior chairs across 13 +tables. Each roundtable session included an invited senior chair (with +substantial experience in the field), junior chairs (responsible for +facilitating the discussion), and attendees from diverse backgrounds with an +interest in the session's topic. -摘要:我們提出異質群體,一種演算法,透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG),並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數,異質群體使用兩個反覆步驟:角色步驟和權重步驟。對於角色步驟,我們將模型角色解釋為學習一個 DAG,它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始,我們將它們解碼為離散 DAG,以拓撲順序呼叫 LLM,根據效用函數(例如任務的準確度)進行評估,並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟,我們評估個別 LLM 在多 LLM 系統中的貢獻,並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻,然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明,異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明,異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統,並受益於語言模型的多樣性。 +摘要:第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議,以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席(在該領域擁有豐富的經驗)、初級主席(負責促進討論)以及對會議主題感興趣的來自不同背景的與會者。 -##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot** -2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao +##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering** +2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla -Retrieval-augmented generation (RAG) is a well-suited technique for -retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a -key module of the healthcare copilot, helping reduce misdiagnosis for -healthcare practitioners and patients. However, the diagnostic accuracy and -specificity of existing heuristic-based RAG models used in the medical domain -are inadequate, particularly for diseases with similar manifestations. This -paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited -reasoning for the medical domain that retrieves diagnosis and treatment -recommendations based on manifestations. MedRAG systematically constructs a -comprehensive four-tier hierarchical diagnostic KG encompassing critical -diagnostic differences of various diseases. These differences are dynamically -integrated with similar EHRs retrieved from an EHR database, and reasoned -within a large language model. This process enables more accurate and specific -decision support, while also proactively providing follow-up questions to -enhance personalized medical decision-making. MedRAG is evaluated on both a -public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) -collected from Tan Tock Seng Hospital, and its performance is compared against -various existing RAG methods. Experimental results show that, leveraging the -information integration and relational abilities of the KG, our MedRAG provides -more specific diagnostic insights and outperforms state-of-the-art models in -reducing misdiagnosis rates. Our code will be available at -https://github.com/SNOWTEAM2023/MedRAG +Current Large Language Models (LLMs) benchmarks are often based on open-ended +or close-ended QA evaluations, avoiding the requirement of human labor. +Close-ended measurements evaluate the factuality of responses but lack +expressiveness. Open-ended capture the model's capacity to produce discourse +responses but are harder to assess for correctness. These two approaches are +commonly used, either independently or together, though their relationship +remains poorly understood. This work is focused on the healthcare domain, where +both factuality and discourse matter greatly. It introduces a comprehensive, +multi-axis suite for healthcare LLM evaluation, exploring correlations between +open and close benchmarks and metrics. Findings include blind spots and +overlaps in current methodologies. As an updated sanity check, we release a new +medical benchmark --CareQA-- with both open and closed variants. Finally, we +propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to +mitigate the identified limitations. -摘要:檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組,協助減少醫療保健從業人員和患者的誤診。然而,在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足,特別是對於具有類似表現的疾病。本文提出 MedRAG,一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型,用於醫療領域,它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG,涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合,並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援,同時主動提供後續問題,以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估,並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示,利用 KG 的資訊整合和關係能力,我們的 MedRAG 提供了更具體的診斷見解,並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供 +摘要:當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量,避免了人力需求。封閉式測量評估回應的事實性,但缺乏表達力。開放式測量捕捉模型產生論述回應的能力,但較難評估正確性。這兩種方法通常獨立或合併使用,儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域,在該領域中,事實性和論述都非常重要。它引入了一個全面的多軸套件,用於醫療保健 LLM 評量,探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查,我們發布了一個新的醫療基準--CareQA--,包含開放式和封閉式變體。最後,我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。 -##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering** -2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck +##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging** +2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra -Most existing Knowledge Graph Question Answering (KGQA) approaches are -designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the -heterogeneity of the underlying graph schema, topology and assertions, most -KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without -resource-intensive training data. We present OntoSCPrompt, a novel Large -Language Model (LLM)-based KGQA approach with a two-stage architecture that -separates semantic parsing from KG-dependent interactions. OntoSCPrompt first -generates a SPARQL query structure (including SPARQL keywords such as SELECT, -ASK, WHERE and placeholders for missing tokens) and then fills them with -KG-specific information. To enhance the understanding of the underlying KG, we -present an ontology-guided, hybrid prompt learning strategy that integrates KG -ontology into the learning process of hybrid prompts (e.g., discrete and -continuous vectors). We also present several task-specific decoding strategies -to ensure the correctness and executability of generated SPARQL queries in both -stages. Experimental results demonstrate that OntoSCPrompt performs as well as -SOTA approaches without retraining on a number of KGQA datasets such as CWQ, -WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well -to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: -\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} +Accurate classification and anatomical localization are essential for +effective medical diagnostics and research, which may be efficiently performed +using deep learning techniques. However, availability of limited labeled data +poses a significant challenge. To address this, we adapted Prototypical +Networks and the Propagation-Reconstruction Network (PRNet) for few-shot +classification and localization, respectively, in Single Photon Emission +Computed Tomography (SPECT) images. For the proof of concept we used a +2D-sliced image cropped around heart. The Prototypical Network, with a +pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver +tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for +2D imaging with an encoder-decoder architecture and skip connections, achieved +a training loss of 1.395, accurately reconstructing patches and capturing +spatial relationships. These results highlight the potential of Prototypical +Networks for tissue classification with limited labeled data and PRNet for +anatomical landmark localization, paving the way for improved performance in +deep learning frameworks. -摘要:現有的知識圖譜問答(KGQA)方法大多是為特定 KG 而設計的,例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性,大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜(KG)。我們提出 OntoSCPrompt,這是一種基於大型語言模型(LLM)的新型 KGQA 方法,採用兩階段架構,將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構(包括 SPARQL 關鍵字,例如 SELECT、ASK、WHERE 和缺失令牌的佔位符),然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解,我們提出了一種由本体指導的混合提示學習策略,將 KG 本体整合到混合提示(例如,離散和連續向量)的學習過程中。我們還提出了多種特定任務的解碼策略,以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明,OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時,效能與 SOTA 方法一樣好,且資源使用效率高,並且可以很好地概括到未見過的特定領域 KG,例如 DBLP-QuAD 和 CoyPu KG Code: -\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} +摘要:精確的分類和解剖定位對於有效的醫療診斷和研究至關重要,而這可以使用深度學習技術有效執行。然而,標記資料有限的取得會造成重大的挑戰。為了解決這個問題,我們分別調整了原型網路和傳播重建網路 (PRNet),用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念,我們使用圍繞心臟裁切的 2D 切片影像。原型網路,使用預先訓練的 ResNet-18 主幹,對心室、心肌和肝臟組織進行分類,訓練準確度為 96.67%,驗證準確度為 93.33%。PRNet,調整為使用編碼器解碼器架構和跳躍連接的 2D 影像,達到了 1.395 的訓練損失,精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力,以及 PRNet 在解剖標誌定位方面的潛力,為深度學習架構中效能的提升鋪平了道路。 -##### **Multimodal Medical Code Tokenizer** -2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik +##### **Illegal Waste Detection in Remote Sensing Images: A Case Study** +2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori -Foundation models trained on patient electronic health records (EHRs) require -tokenizing medical data into sequences of discrete vocabulary items. Existing -tokenizers treat medical codes from EHRs as isolated textual tokens. However, -each medical code is defined by its textual description, its position in -ontological hierarchies, and its relationships to other codes, such as disease -co-occurrences and drug-treatment associations. Medical vocabularies contain -more than 600,000 codes with critical information for clinical reasoning. We -introduce MedTok, a multimodal medical code tokenizer that uses the text -descriptions and relational context of codes. MedTok processes text using a -language model encoder and encodes the relational structure with a graph -encoder. It then quantizes both modalities into a unified token space, -preserving modality-specific and cross-modality information. We integrate -MedTok into five EHR models and evaluate it on operational and clinical tasks -across in-patient and out-patient datasets, including outcome prediction, -diagnosis classification, drug recommendation, and risk stratification. -Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR -models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with -the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate -using MedTok tokenizer with medical QA systems. Our results demonstrate the -potential of MedTok as a unified tokenizer for medical codes, improving -tokenization for medical foundation models. +Environmental crime currently represents the third largest criminal activity +worldwide while threatening ecosystems as well as human health. Among the +crimes related to this activity, improper waste management can nowadays be +countered more easily thanks to the increasing availability and decreasing cost +of Very-High-Resolution Remote Sensing images, which enable semi-automatic +territory scanning in search of illegal landfills. This paper proposes a +pipeline, developed in collaboration with professionals from a local +environmental agency, for detecting candidate illegal dumping sites leveraging +a classifier of Remote Sensing images. To identify the best configuration for +such classifier, an extensive set of experiments was conducted and the impact +of diverse image characteristics and training settings was thoroughly analyzed. +The local environmental agency was then involved in an experimental exercise +where outputs from the developed classifier were integrated in the experts' +everyday work, resulting in time savings with respect to manual +photo-interpretation. The classifier was eventually run with valuable results +on a location outside of the training area, highlighting potential for +cross-border applicability of the proposed pipeline. -摘要:在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而,每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系(例如疾病共现和药物治疗关联)来定义。医学词汇表包含超过 600,000 个代码,这些代码包含临床推理的关键信息。我们引入了 MedTok,这是一种多模态医学代码标记器,它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本,并使用图编码器对关系结构进行编码。然后,它将这两种模态量化为一个统一的标记空间,保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中,并在住院和门诊数据集(包括结果预测、诊断分类、药物推荐和风险分层)上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC,在 MIMIC-III 上提高 4.10%,在 MIMIC-IV 上提高 4.78%,在 EHRShot 上提高 11.30%,其中药物推荐的增益最大。除了 EHR 建模之外,我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力,改进了医学基础模型的标记化。 +摘要:環境犯罪目前是全球第三大犯罪活動,威脅生態系統和人類健康。在與此活動相關的犯罪中,不當廢物管理現在可以更容易地得到解決,這要歸功於超高解析度遙測影像越來越普及且成本下降,這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道,與當地環境機構的專業人士合作開發,用於檢測候選非法傾倒地點,利用遙測影像分類器。為了找出這種分類器的最佳配置,進行了一系列廣泛的實驗,並徹底分析了不同影像特徵和訓練設定的影響。然後,當地環境機構參與了一項實驗練習,其中將已開發分類器的輸出整合到專家的日常工作中,從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器,獲得了有價值的結果,突出了所提出管道的跨境適用性潛力。 -##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents** -2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu +##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model** +2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li -The rapid expansion of web content has made on-device AI assistants -indispensable for helping users manage the increasing complexity of online -tasks. The emergent reasoning ability in large language models offer a -promising path for next-generation on-device AI agents. However, deploying -full-scale Large Language Models (LLMs) on resource-limited local devices is -challenging. In this paper, we propose Division-of-Thoughts (DoT), a -collaborative reasoning framework leveraging the synergy between locally -deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT -leverages a Task Decomposer to elicit the inherent planning abilities in -language models to decompose user queries into smaller sub-tasks, which allows -hybrid language models to fully exploit their respective strengths. Besides, -DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks -and create a dependency graph, facilitating parallel reasoning of sub-tasks and -the identification of key steps. To allocate the appropriate model based on the -difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an -additional task head attached to the SLM that does not alter the SLM's -parameters. To boost adapter's task allocation capability, we propose a -self-reinforced training method that relies solely on task execution feedback. -Extensive experiments on various benchmarks demonstrate that our DoT -significantly reduces LLM costs while maintaining competitive reasoning -accuracy. Specifically, DoT reduces the average reasoning time and API costs by -66.12% and 83.57%, while achieving comparable reasoning accuracy with the best -baseline methods. +Accurate and efficient electroencephalography (EEG) analysis is essential for +detecting seizures and artifacts in long-term monitoring, with applications +spanning hospital diagnostics to wearable health devices. Robust EEG analytics +have the potential to greatly improve patient care. However, traditional deep +learning models, especially Transformer-based architectures, are hindered by +their quadratic time and memory complexity, making them less suitable for +resource-constrained environments. To address these challenges, we present +FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel +self-supervised framework that establishes new efficiency benchmarks for EEG +analysis through bidirectional state-space modeling. Unlike Transformer-based +models, which incur quadratic time and memory complexity, FEMBA scales linearly +with sequence length, enabling more scalable and efficient processing of +extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and +fine-tuned on three downstream tasks, FEMBA achieves competitive performance in +comparison with transformer models, with significantly lower computational +cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB +and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates +viability for resource-constrained devices. These results pave the way for +scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as +a promising candidate for wearable applications. -摘要:網頁內容快速擴充,使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而,在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中,我們提出了思想分工 (DoT),一個協作推理框架,利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力,將使用者查詢分解成較小的子任務,這允許混合語言模型充分發揮其各自的優勢。此外,DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖,促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型,DoT 利用了即插即用適配器,這是一個附加在 SLM 上的任務頭,不會改變 SLM 的參數。為了提升適配器的任務分配能力,我們提出了一種自我強化訓練方法,它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明,我們的 DoT 大幅降低了 LLM 成本,同時維持了有競爭力的推理準確度。具體來說,DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%,同時達到了與最佳基準方法相當的推理準確度。 +摘要:準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要,其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而,傳統深度學習模型,特別是基於 Transformer 的架構,受到其二次時間和記憶體複雜度的阻礙,使其不太適合資源受限的環境。為了應對這些挑戰,我們提出 FEMBA (基礎 EEG Mamba + 雙向架構),一種創新的自我監督架構,透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同,FEMBA 隨著序列長度線性縮放,支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調,與Transformer模型相比,在計算成本顯著降低的情況下,實現了具有競爭力的效能。具體來說,它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC,而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路,並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。 -##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models** -2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong +##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?** +2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham -Knowledge Graph-based recommendations have gained significant attention due -to their ability to leverage rich semantic relationships. However, constructing -and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy -of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent -advancements in Large Language Models (LLMs) offer a promising way to improve -the quality and relevance of KGs for recommendation tasks. Despite this, -integrating LLMs into KG-based systems presents challenges, such as efficiently -augmenting KGs, addressing hallucinations, and developing effective joint -learning methods. In this paper, we propose the Confidence-aware KG-based -Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework -that combines KGs and LLMs for recommendation task. The framework includes: (1) -an LLM-based subgraph augmenter for enriching KGs with high-quality -information, (2) a confidence-aware message propagation mechanism to filter -noisy triplets, and (3) a dual-view contrastive learning method to integrate -user-item interactions and KG data. Additionally, we employ a confidence-aware -explanation generation process to guide LLMs in producing realistic -explanations for recommendations. Finally, extensive experiments demonstrate -the effectiveness of CKG-LLMA across multiple public datasets. +The advent of foundation models (FMs) is transforming medical domain. In +ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 +million natural images and 1.6 million retinal images, has demonstrated high +adaptability across clinical applications. Conversely, DINOv2, a +general-purpose vision FM pre-trained on 142 million natural images, has shown +promise in non-medical domains. However, its applicability to clinical tasks +remains underexplored. To address this, we conducted head-to-head evaluations +by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular +disease detection and systemic disease prediction tasks, across eight +standardized open-source ocular datasets, as well as the Moorfields AlzEye and +the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting +diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, +all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In +glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, +P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 +models in predicting heart failure, myocardial infarction, and ischaemic stroke +(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even +with 10% of the fine-tuning data. These findings showcase the distinct +scenarios where general-purpose and domain-specific FMs excel, highlighting the +importance of aligning FM selection with task-specific requirements to optimise +clinical performance. -摘要:基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而,構建和維護知識圖譜 (KG) 是一項資源密集型任務,而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此,將 LLM 整合到基於 KG 的系統中會帶來挑戰,例如有效擴充 KG、處理幻覺,以及開發有效的聯合學習方法。在本文中,我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA),這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括:(1) 一個基於 LLM 的子圖擴充器,用於使用高品質資訊豐富 KG,(2) 一個信心感知型訊息傳播機制,用於過濾雜訊三元組,以及 (3) 一個雙視圖對比學習方法,用於整合使用者-項目互動和 KG 資料。此外,我們採用一個信心感知型解釋產生程序,以引導 LLM 為推薦產生逼真的解釋。最後,大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。 +摘要:基礎模型 (FM) 的出現正在轉變醫療領域。在眼科,RETFound 是一個視網膜專用 FM,依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練,已展現出高度適應性,可應用於各種臨床應用。相反地,DINOv2 是一個通用視覺 FM,使用 1.42 億張自然影像進行預訓練,已展現出在非醫療領域的潛力。然而,其在臨床任務中的適用性仍未被充分探索。為了解決這個問題,我們針對眼部疾病偵測和全身性疾病預測任務,對 RETFound 和三個 DINOv2 模型(大型、基礎、小型)進行微調,並進行一對一的評估,使用八個標準化的開源眼科資料集,以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound(三個資料集的 AUROC=0.850-0.952,相較於 0.823-0.944,所有 P<=0.007)和多類眼部疾病(AUROC=0.892,相較於 0.846,P<0.001)。在青光眼方面,DINOv2 基礎模型優於 RETFound(AUROC=0.958,相較於 0.940,P<0.001)。相反地,RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型(AUROC=0.732-0.796,相較於 0.663-0.771,所有 P<0.001)。即使使用 10% 的微調資料,這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景,突顯了根據任務特定需求調整 FM 選擇,以最佳化臨床表現的重要性。 -##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)** -2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell +##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning** +2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun -Scene graphs have emerged as a structured and serializable environment -representation for grounded spatial reasoning with Large Language Models -(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason -framework for reasoning and planning with scene graphs. Our approach employs -two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and -information queries generation, and a (2) Retriever for extracting -corresponding graph information following the queries. Two agents collaborate -iteratively, enabling sequential reasoning and adaptive attention to graph -information. Unlike prior works, both agents are prompted only with the scene -graph schema rather than the full graph data, which reduces the hallucination -by limiting input tokens, and drives the Reasoner to generate reasoning trace -abstractly.Following the trace, the Retriever programmatically query the scene -graph data based on the schema understanding, allowing dynamic and global -attention on the graph that enhances alignment between reasoning and retrieval. -Through experiments in multiple simulation environments, we show that our -framework surpasses existing LLM-based approaches in numerical Q\&A and -planning tasks, and can benefit from task-level few-shot examples, even in the -absence of agent-level demonstrations. Project code will be released. +Medical time series are often irregular and face significant missingness, +posing challenges for data analysis and clinical decision-making. Existing +methods typically adopt a single modeling perspective, either treating series +data as sequences or transforming them into image representations for further +classification. In this paper, we propose a joint learning framework that +incorporates both sequence and image representations. We also design three +self-supervised learning strategies to facilitate the fusion of sequence and +image representations, capturing a more generalizable joint representation. The +results indicate that our approach outperforms seven other state-of-the-art +models in three representative real-world clinical datasets. We further +validate our approach by simulating two major types of real-world missingness +through leave-sensors-out and leave-samples-out techniques. The results +demonstrate that our approach is more robust and significantly surpasses other +baselines in terms of classification performance. -摘要:場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中,我們提出 SG-RwR,一個以綱要為導向的檢索與推理框架,用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理:一個 (1) 推論器,用於任務規劃和資訊查詢產生,以及一個 (2) 檢索器,用於根據查詢提取對應的圖形資訊。兩個代理反覆合作,實現對圖形資訊的順序推理和適應性關注。與先前的作品不同,兩個代理僅提示場景圖表綱要,而不是完整的圖形資料,這透過限制輸入代碼減少了幻覺,並驅使推論器抽象地產生推理軌跡。根據軌跡,檢索器根據綱要理解以程式化方式查詢場景圖形資料,允許對圖形進行動態和整體關注,增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗,我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法,並且可以受益於任務級別的少次範例,即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。 +摘要:醫療時間序列通常不規則且會面臨顯著的缺失,對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點,將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中,我們提出了一個聯合學習架構,結合序列和影像表示。我們還設計了三種自我監督學習策略,以促進序列和影像表示的融合,捕捉更具概括性的聯合表示。結果表明,我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明,我們的做法更強大,並且在分類效能方面顯著優於其他基準。 -##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs** -2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin +##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** +2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek -Recent advancements have highlighted that Large Language Models (LLMs) are -prone to hallucinations when solving complex reasoning problems, leading to -erroneous results. To tackle this issue, researchers incorporate Knowledge -Graphs (KGs) to improve the reasoning ability of LLMs. However, existing -methods face two limitations: 1) they typically assume that all answers to the -questions are contained in KGs, neglecting the incompleteness issue of KGs, and -2) they treat the KG as a static repository and overlook the implicit logical -reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an -innovative neural-symbolic agent framework that achieves collaborative -augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments -and transform complex reasoning tasks into a multi-step interactive process, -enabling KGs to participate deeply in the reasoning process. SymAgent consists -of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages -LLM's inductive reasoning capability to extract symbolic rules from KGs, -guiding efficient question decomposition. The Agent-Executor autonomously -invokes predefined action tools to integrate information from KGs and external -documents, addressing the issues of KG incompleteness. Furthermore, we design a -self-learning framework comprising online exploration and offline iterative -policy updating phases, enabling the agent to automatically synthesize -reasoning trajectories and improve performance. Experimental results -demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields -better or comparable performance compared to various strong baselines. Further -analysis reveals that our agent can identify missing triples, facilitating -automatic KG updates. +We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), +an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS +predicts future PHTs using transformer-based architectures. The Adaptive Risk +Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk +probabilities for clinician-defined critical events. ARES incorporates a +personalized explainability module that identifies key clinical factors +influencing risk estimates for individual patients. ARES was evaluated on the +MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its +performance against traditional early warning systems and machine learning +models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, +with 60% including hospital admissions. The dataset contained over 357 million +tokens. ETHOS outperformed benchmark models in predicting hospital admissions, +ICU admissions, and prolonged hospital stays, achieving superior AUC scores. +ETHOS-based risk estimates demonstrated robustness across demographic subgroups +with strong model reliability, confirmed via calibration curves. The +personalized explainability module provides insights into patient-specific +factors contributing to risk. ARES, powered by ETHOS, advances predictive +healthcare AI by providing dynamic, real-time, and personalized risk estimation +with patient-specific explainability to enhance clinician trust. Its +adaptability and superior accuracy position it as a transformative tool for +clinical decision-making, potentially improving patient outcomes and resource +allocation in emergency and inpatient settings. We release the full code at +github.com/ipolharvard/ethos-ares to facilitate future research. -摘要:最近的進展強調出,大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺,導致錯誤的結果。為了解決這個問題,研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而,現有方法面臨兩個限制:1) 它們通常假設問題的所有答案都包含在 KG 中,忽略了 KG 的不完整性問題,以及 2) 它們將 KG 視為一個靜態儲存庫,而忽略了 KG 中固有的隱式邏輯推理結構。在本文中,我們介紹了 SymAgent,一個創新的神經符號代理架構,它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境,並將複雜的推理任務轉化為一個多步驟的互動過程,使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組:代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則,指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊,解決 KG 不完整性的問題。此外,我們設計了一個自學習框架,包括線上探索和離線反覆的政策更新階段,使代理能夠自動合成推理軌跡並改善效能。實驗結果表明,具有弱 LLM 主幹的 SymAgent(例如,7B 系列)與各種強大的基線相比,產生了更好或相當的效能。進一步的分析表明,我們的代理可以識別遺失的三元組,促進自動 KG 更新。 +摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), +一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS +使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 -##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models** -2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov +##### **Can ChatGPT Diagnose Alzheimer's Disease?** +2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin -We introduce a new approach to systematically map features discovered by -sparse autoencoder across consecutive layers of large language models, -extending earlier work that examined inter-layer feature links. By using a -data-free cosine similarity technique, we trace how specific features persist, -transform, or first appear at each stage. This method yields granular flow -graphs of feature evolution, enabling fine-grained interpretability and -mechanistic insights into model computations. Crucially, we demonstrate how -these cross-layer feature maps facilitate direct steering of model behavior by -amplifying or suppressing chosen features, achieving targeted thematic control -in text generation. Together, our findings highlight the utility of a causal, -cross-layer interpretability framework that not only clarifies how features -develop through forward passes but also provides new means for transparent -manipulation of large language models. +Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating +neurodegenerative condition that affects approximately 1 in 9 individuals aged +65 and older, profoundly impairing memory and cognitive function. This paper +utilises 9300 electronic health records (EHRs) with data from Magnetic +Resonance Imaging (MRI) and cognitive tests to address an intriguing question: +As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs? +We present an in-depth evaluation of ChatGPT using a black-box approach with +zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to +analyse MRI and cognitive test results, as well as its potential as a +diagnostic tool for AD. By automating aspects of the diagnostic process, this +research opens a transformative approach for the healthcare system, +particularly in addressing disparities in resource-limited regions where AD +specialists are scarce. Hence, it offers a foundation for a promising method +for early detection, supporting individuals with timely interventions, which is +paramount for Quality of Life (QoL). -摘要:我們提出了一種新方法,用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能,擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術,我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖,實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是,我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導,在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用,不僅闡明了特徵如何透過前向傳遞發展,還提供了新的方法來透明地操作大型語言模型。 +摘要:ChatGPT 能否診斷出阿茲海默症 (AD)?AD 是一種毀滅性的神經退化性疾病,影響約 1/9 的 65 歲及以上人士,嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR),其中包含磁共振成像 (MRI) 和認知測試的數據,來解決一個有趣的問題:作為一個通用任務解決器,ChatGPT 能否使用 EHR 準確地檢測出 AD?我們使用黑盒方法對 ChatGPT 進行了深入評估,採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力,以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面,這項研究為醫療保健系統開啟了一種變革性的方法,特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此,它為一種有希望的早期檢測方法奠定了基礎,通過及時干預來支持個人,這對於生活品質 (QoL) 至關重要。 -##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs** -2502.02896v1 by Bradley P. Allen, Paul T. Groth +##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking** +2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares -Evaluating large language models (LLMs) for tasks like fact extraction in -support of knowledge graph construction frequently involves computing accuracy -metrics using a ground truth benchmark based on a knowledge graph (KG). These -evaluations assume that errors represent factual disagreements. However, human -discourse frequently features metalinguistic disagreement, where agents differ -not on facts but on the meaning of the language used to express them. Given the -complexity of natural language processing and generation using LLMs, we ask: do -metalinguistic disagreements occur between LLMs and KGs? Based on an -investigation using the T-REx knowledge alignment dataset, we hypothesize that -metalinguistic disagreement does in fact occur between LLMs and KGs, with -potential relevance for the practice of knowledge graph engineering. We propose -a benchmark for evaluating the detection of factual and metalinguistic -disagreements between LLMs and KGs. An initial proof of concept of such a -benchmark is available on Github. +EEG-based neural networks, pivotal in medical diagnosis and brain-computer +interfaces, face significant intellectual property (IP) risks due to their +reliance on sensitive neurophysiological data and resource-intensive +development. Current watermarking methods, particularly those using abstract +trigger sets, lack robust authentication and fail to address the unique +challenges of EEG models. This paper introduces a cryptographic wonder +filter-based watermarking framework tailored for EEG-based neural networks. +Leveraging collision-resistant hashing and public-key encryption, the wonder +filter embeds the watermark during training, ensuring minimal distortion ($\leq +5\%$ drop in EEG task accuracy) and high reliability (100\% watermark +detection). The framework is rigorously evaluated against adversarial attacks, +including fine-tuning, transfer learning, and neuron pruning. Results +demonstrate persistent watermark retention, with classification accuracy for +watermarked states remaining above 90\% even after aggressive pruning, while +primary task performance degrades faster, deterring removal attempts. Piracy +resistance is validated by the inability to embed secondary watermarks without +severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic +hashing ensures authentication, reducing brute-force attack success +probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet, +TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively +eliminating false positives. By integrating wonder filters with EEG-specific +adaptations, this work bridges a critical gap in IP protection for +neurophysiological models, offering a secure, tamper-proof solution for +healthcare and biometric applications. The framework's robustness against +adversarial modifications underscores its potential to safeguard sensitive EEG +models while maintaining diagnostic utility. -摘要:評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時,通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而,人類話語經常出現元語言分歧,其中代理人之間的差異不在於事實,而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性,我們提出疑問:LLM 和 KG 之間是否會發生元語言分歧?根據使用 T-REx 知識比對資料集進行的調查,我們假設元語言分歧確實會發生在 LLM 和 KG 之間,並可能與知識圖譜工程實務有關。我們提出一個基準,用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。 +摘要:基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要,由於其依賴敏感的神經生理資料和資源密集型的開發,面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法,特別是那些使用抽象觸發集的方法,缺乏強健的驗證,且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密,wonder 濾波器在訓練期間嵌入浮水印,確保最小的失真(EEG 任務準確度下降 $\leq 5\%$)和高可靠性(100% 浮水印檢測)。該架構針對對抗性攻擊進行了嚴格的評估,包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留,即使在激進的剪枝後,浮水印狀態的分類準確度仍保持在 90% 以上,而主要任務的性能下降得更快,阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證,而不會造成嚴重的準確度損失(在 EEGNet 和 CCNN 模型中 $>10\%$)。密碼學雜湊確保驗證,降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型(CCNN、EEGNet、TSception)進行評估,該方法達到了 $>99.4\%$ 的空嵌入準確度,有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合,這項工作彌補了神經生理模型 IP 保護中的關鍵差距,為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。 -##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization** -2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim +##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models** +2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen -Recent advances in Large Language Models (LLMs) have motivated the -development of general LLMs for molecular tasks. While several studies have -demonstrated that fine-tuned LLMs can achieve impressive benchmark -performances, they are far from genuine generalist molecular LLMs due to a lack -of fundamental understanding of molecular structure. Specifically, when given -molecular task instructions, LLMs trained with naive next-token prediction -training assign similar likelihood scores to both original and negatively -corrupted molecules, revealing their lack of molecular structure understanding -that is crucial for reliable and general molecular LLMs. To overcome this -limitation and obtain a true generalist molecular LLM, we introduce a novel -multi-modal training method based on a thorough multi-modal instruction tuning -as well as a molecular structure preference optimization between chosen and -rejected graphs. On various molecular benchmarks, the proposed generalist -molecular LLM, called Mol-LLM, achieves state-of-the-art performances among -generalist LLMs on most tasks, at the same time, surpassing or comparable to -state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior -generalization performances in reaction prediction tasks, demonstrating the -effect of the molecular structure understanding for generalization perspective. +Depression is one of the leading causes of disability worldwide, posing a +severe burden on individuals, healthcare systems, and society at large. Recent +advancements in Large Language Models (LLMs) have shown promise in addressing +mental health challenges, including the detection of depression through +text-based analysis. However, current LLM-based methods often struggle with +nuanced symptom identification and lack a transparent, step-by-step reasoning +process, making it difficult to accurately classify and explain mental health +conditions. To address these challenges, we propose a Chain-of-Thought +Prompting approach that enhances both the performance and interpretability of +LLM-based depression detection. Our method breaks down the detection process +into four stages: (1) sentiment analysis, (2) binary depression classification, +(3) identification of underlying causes, and (4) assessment of severity. By +guiding the model through these structured reasoning steps, we improve +interpretability and reduce the risk of overlooking subtle clinical indicators. +We validate our method on the E-DAIC dataset, where we test multiple +state-of-the-art large language models. Experimental results indicate that our +Chain-of-Thought Prompting technique yields superior performance in both +classification accuracy and the granularity of diagnostic insights, compared to +baseline approaches. -摘要:大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能,但由於缺乏對分子結構的基本理解,它們遠非真正的通才分子 LLM。具體來說,當給予分子任務說明時,使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子,這顯示出它們缺乏對分子結構的理解,而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM,我們引入了一種新穎的多模態訓練方法,該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中,所提出的通才分子 LLM(稱為 Mol-LLM)在多數任務中實現了通才 LLM 中的最新效能,同時超越或與最新的專家 LLM 相當。此外,Mol-LLM 在反應預測任務中也展現出優異的泛化效能,證明了分子結構理解對泛化觀點的影響。 +摘要:憂鬱症是全球殘障的主要原因之一,對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望,包括透過基於文字的分析來偵測憂鬱症。然而,現有的基於 LLM 的方法通常難以辨識細微的症狀,而且缺乏透明且逐步的推理過程,這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰,我們提出了一種思考鏈提示方法,它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段:(1) 情緒分析,(2) 二元憂鬱症分類,(3) 找出潛在原因,以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟,我們提升了可解釋性,並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法,並在其中測試了多種最先進的大型語言模型。實驗結果顯示,與基線方法相比,我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。 -##### **Leveraging the true depth of LLMs** -2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret +##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison** +2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis -Large Language Models demonstrate remarkable capabilities at the cost of high -compute requirements. While recent research has shown that intermediate layers -can be removed or have their order shuffled without impacting performance -significantly, these findings have not been employed to reduce the -computational cost of inference. We investigate several potential ways to -reduce the depth of pre-trained LLMs without significantly affecting -performance. Leveraging our insights, we present a novel approach that exploits -this decoupling between layers by grouping some of them into pairs that can be -evaluated in parallel. - This modification of the computational graph -- through better parallelism -- -results in an average improvement of around 1.20x on the number of tokens -generated per second, without re-training nor fine-tuning, while retaining -95%-99% of the original accuracy. Empirical evaluation demonstrates that this -approach significantly improves serving efficiency while maintaining model -performance, offering a practical improvement for large-scale LLM deployment. +The increasing volume of drug combinations in modern therapeutic regimens +needs reliable methods for predicting drug-drug interactions (DDIs). While +Large Language Models (LLMs) have revolutionized various domains, their +potential in pharmaceutical research, particularly in DDI prediction, remains +largely unexplored. This study thoroughly investigates LLMs' capabilities in +predicting DDIs by uniquely processing molecular structures (SMILES), target +organisms, and gene interaction data as raw text input from the latest DrugBank +dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4, +Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first +assessing their zero-shot capabilities in DDI prediction. We then fine-tuned +selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1 +distilled Qwen 1.5B) to optimize their performance. Our comprehensive +evaluation framework included validation across 13 external DDI datasets, +comparing against traditional approaches such as l2-regularized logistic +regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5 +2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of +0.919 on balanced datasets (50% positive, 50% negative cases). This result +represents an improvement over both zero-shot predictions and state-of-the-art +machine-learning methods used for DDI prediction. Our analysis reveals that +LLMs can effectively capture complex molecular interaction patterns and cases +where drug pairs target common genes, making them valuable tools for practical +applications in pharmaceutical research and clinical settings. -摘要:大型语言模型展示了其强大的功能,但代价是较高的计算需求。虽然最近的研究表明,中间层可以被移除或重新排列其顺序,而不会显著影响性能,但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度,而不会显著影响性能。利用我们的见解,我们提出了一种新颖的方法,该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。 -通过更好的并行性对计算图进行修改,平均而言,每秒生成的令牌数量提高了约 1.20 倍,而无需重新训练或微调,同时保留了 95%-99% 的原始准确性。经验评估表明,这种方法显著提高了服务效率,同时保持了模型性能,为大规模 LLM 部署提供了实际改进。 +摘要:現代治療方案中藥物組合的數量越來越多,需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命,它們在藥物研究中的潛力,特別是在 DDI 預測中的潛力,仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入,徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM,包括專有模型(GPT-4、Claude、Gemini)和開源變體(從 1.5B 到 72B 參數),首先評估它們在 DDI 預測中的零次學習能力。然後,我們微調選定的模型(GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B)以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證,並與傳統方法(例如 l2 正則化邏輯迴歸)進行比較。微調後的 LLM 表現出優異的效能,其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度,在平衡資料集(50% 正例,50% 反例)上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明,LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況,使其成為藥物研究和臨床環境中實用應用的寶貴工具。 -##### **Modular Training of Neural Networks aids Interpretability** -2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots +##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)** +2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh -An approach to improve neural network interpretability is via clusterability, -i.e., splitting a model into disjoint clusters that can be studied -independently. We define a measure for clusterability and show that pre-trained -models form highly enmeshed clusters via spectral graph clustering. We thus -train models to be more modular using a "clusterability loss" function that -encourages the formation of non-interacting clusters. Using automated -interpretability techniques, we show that our method can help train models that -are more modular and learn different, disjoint, and smaller circuits. We -investigate CNNs trained on MNIST and CIFAR, small transformers trained on -modular addition, and language models. Our approach provides a promising -direction for training neural networks that learn simpler functions and are -easier to interpret. +Detecting sensitive data such as Personally Identifiable Information (PII) +and Protected Health Information (PHI) is critical for data security platforms. +This study evaluates regex-based pattern matching algorithms and exact-match +search techniques to optimize detection speed, accuracy, and scalability. Our +benchmarking results indicate that Google RE2 provides the best balance of +speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among +regex engines, outperforming PCRE while maintaining broader hardware +compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated +superior performance (8 ms/MB) and scalability for large datasets. Performance +analysis revealed that regex processing time scales linearly with dataset size +and pattern complexity. A hybrid AI + Regex approach achieved the highest F1 +score (91. 6%) by improving recall and minimizing false positives. Device +benchmarking confirmed that our solution maintains efficient CPU and memory +usage on both high-performance and mid-range systems. Despite its +effectiveness, challenges remain, such as limited multilingual support and the +need for regular pattern updates. Future work should focus on expanding +language coverage, integrating data security and privacy management (DSPM) with +data loss prevention (DLP) tools, and enhancing regulatory compliance for +broader global adoption. -摘要:一種改善神經網路可解釋性的方法是透過群集性, -也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量,並顯示預訓練的 -模型透過光譜圖形群集形成高度糾纏的群集。因此,我們使用「群集性損失」函數訓練模型,使其更具模組化, -這鼓勵形成非交互群集。使用自動化可解釋性技術,我們顯示我們的模型可以幫助訓練更具模組化的模型,並學習不同、不相交且較小的電路。我們 -研究了在 MNIST 和 CIFAR 上訓練的 CNN,在模組化加法上訓練的小型Transformer,以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。 +摘要:偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料,對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術,以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示,在 regex 引擎中,Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡,優於 PCRE,同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對,Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示,regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低,達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效,但仍有挑戰存在,例如多語言支援有限,以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍,將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合,以及加強法規遵循以利更廣泛的全球採用。 -##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs** -2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür +##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch** +2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu -Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large -language models (LLMs) by enabling detailed step-by-step solutions. However, -due to the verbosity of LLMs, the resulting reasoning chains can be long, -making it harder to verify the reasoning steps and trace issues resulting from -dependencies between the steps that may be farther away in the sequence of -steps. Importantly, mathematical reasoning allows each step to be derived from -a small set of premises, which are a subset of the preceding steps in the -reasoning chain. In this paper, we present a framework that identifies the -premises for each step, to improve the evaluation of reasoning. We restructure -conventional linear reasoning chains into Premise Augmented Reasoning Chains -(PARC) by introducing premise links, resulting in a directed acyclic graph -where the nodes are the steps and the edges are the premise links. Through -experiments with a PARC-based dataset that we built, namely PERL (Premises and -ERrors identification in LLMs), we demonstrate that LLMs can reliably identify -premises within complex reasoning chains. In particular, even open-source LLMs -achieve 90% recall in premise identification. We also show that PARC helps to -identify errors in reasoning chains more reliably. The accuracy of error -identification improves by 6% to 16% absolute when step-by-step verification is -carried out in PARC under the premises. Our findings highlight the utility of -premise-centric representations in addressing complex problem-solving tasks and -open new avenues for improving the reliability of LLM-based reasoning -evaluations. +While just-in-time interventions (JITIs) have effectively targeted common +health behaviors, individuals often have unique needs to intervene in personal +undesirable actions that can negatively affect physical, mental, and social +well-being. We present WatchGuardian, a smartwatch-based JITI system that +empowers users to define custom interventions for these personal actions with a +small number of samples. For the model to detect new actions based on limited +new data samples, we developed a few-shot learning pipeline that finetuned a +pre-trained inertial measurement unit (IMU) model on public hand-gesture +datasets. We then designed a data augmentation and synthesis process to train +additional classification layers for customization. Our offline evaluation with +26 participants showed that with three, five, and ten examples, our approach +achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of +74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to +compare WatchGuardian against a rule-based intervention. Our results +demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in +undesirable actions, substantially outperforming the baseline by 29.0%. Our +findings underscore the effectiveness of a customizable, AI-driven JITI system +for individuals in need of behavioral intervention in personal undesirable +actions. We envision that our work can inspire broader applications of +user-defined personalized intervention with advanced AI solutions. -摘要:思考鏈(CoT)提示透過提供詳細的逐步解法,增強大型語言模型(LLM)的數學推理能力。然而,由於 LLM 的冗長,產生的推理鏈可能很長,這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難,而這些步驟可能在步驟順序中相距較遠。重要的是,數學推理允許每個步驟從一組小的前提中推導出來,這些前提是推理鏈中前一個步驟的子集。在本文中,我們提出了一個框架,用於識別每個步驟的前提,以改進推理評估。我們透過引入前提連結,將傳統的線性推理鏈重組為前提擴充推理鏈(PARC),產生一個有向無環圖,其中節點是步驟,而邊緣是前提連結。透過我們建立的基於 PARC 的資料集(即 PERL(LLM 中的前提和錯誤識別))進行的實驗,我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是,即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明,PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時,錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用,並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。 +摘要:雖然即時介入(JITIs)有效地針對常見的健康行為,但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian,這是一個基於智慧手錶的 JITI 系統,它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為,我們開發了一個小樣本學習管道,微調了公共手勢資料集上的預訓練慣性測量單元(IMU)模型。然後,我們設計了一個資料擴充和合成流程,以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示,我們的做法使用三個、五個和十個範例,達到了 76.8%、84.7% 和 87.7% 的平均準確度,以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後,我們進行了一項為時四小時的介入研究,以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明,我們的系統導致不良行為顯著減少了 64.0 +- 22.6%,大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用,並採用先進的 AI 解決方案。 -##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement** -2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna +##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care** +2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara -Embodied agents assisting humans are often asked to complete a new task in a -new scenario. An agent preparing a particular dish in the kitchen based on a -known recipe may be asked to prepare a new dish or to perform cleaning tasks in -the storeroom. There may not be sufficient resources, e.g., time or labeled -examples, to train the agent for these new situations. Large Language Models -(LLMs) trained on considerable knowledge across many domains are able to -predict a sequence of abstract actions for such new tasks and scenarios, -although it may not be possible for the agent to execute this action sequence -due to task-, agent-, or domain-specific constraints. Our framework addresses -these challenges by leveraging the generic predictions provided by LLM and the -prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an -agent to quickly adapt to new tasks and scenarios. The robot also solicits and -uses human input as needed to refine its existing knowledge. Based on -experimental evaluation over cooking and cleaning tasks in simulation domains, -we demonstrate that the interplay between LLM, KG, and human input leads to -substantial performance gains compared with just using the LLM output. +Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group +of cancers that account for more than 35% of cancer-related deaths worldwide, +but postoperative complications are unpredictable and can be life-threatening. +In this paper, we investigate how recent advancements in large language models +(LLMs) can benefit remote patient monitoring (RPM) systems through clinical +integration by designing RECOVER, an LLM-powered RPM system for postoperative +GI cancer care. To closely engage stakeholders in the design process, we first +conducted seven participatory design sessions with five clinical staff and +interviewed five cancer patients to derive six major design strategies for +integrating clinical guidelines and information needs into LLM-based RPM +systems. We then designed and implemented RECOVER, which features an +LLM-powered conversational agent for cancer patients and an interactive +dashboard for clinical staff to enable efficient postoperative RPM. Finally, we +used RECOVER as a pilot system to assess the implementation of our design +strategies with four clinical staff and five patients, providing design +implications by identifying crucial design elements, offering insights on +responsible AI, and outlining opportunities for future LLM-powered RPM systems. -摘要:具身代理协助人类时,通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源(例如时间或标记的示例)来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列,尽管代理可能无法执行此动作序列,因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战,使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估,我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。 +摘要:癌症手術是胃腸道 (GI) 癌症的主要治療方式,這類癌症佔全球癌症相關死亡人數的 35% 以上,但術後併發症無法預測,且可能危及生命。在本文中,我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統,方法是設計 RECOVER,一個由 LLM 驅動的 RPM 系統,用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程,我們首先與五位臨床人員進行七場參與式設計會議,並訪談五位癌症患者,以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著,我們設計並實作 RECOVER,其特色在於一個由 LLM 驅動的對話式代理人,供癌症患者使用,以及一個互動式儀表板,供臨床人員使用,以進行有效的術後 RPM。最後,我們使用 RECOVER 作為試點系統,與四位臨床人員和五位患者評估我們設計策略的實作,並透過找出重要的設計元素、提供對負責任 AI 的見解,以及概述未來由 LLM 驅動的 RPM 系統的機會,提出設計意涵。 -##### **On Bob Dylan: A Computational Perspective** -2502.01772v1 by Prashant Garg +##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis** +2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander -Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style --- a constant refusal to conform to expectation and a penchant for reinventing -his musical and lyrical identity. In this paper, I extend Sunstein's -observations through a large-scale computational analysis of Dylan's lyrics -from 1962 to 2012. Using o3-mini-high (a large language model), I extract -concept-to-concept relationships from the lyrics and construct directed -knowledge graphs that capture Dylan's thematic structure. I then quantify -shifts in sentiment, metaphorical expression, thematic diversity, and network -complexity over time. The results indicate that Dylan's lyrics increasingly -rely on metaphor, display an evolving sentiment profile, and exhibit heightened -dishabituation -- measured here as a growing variance in the network centrality -of key concepts. I also find that references to movement, protest, and mythic -imagery fluctuate in ways that align with well-known phases of Dylan's career, -reflecting the dynamic and unpredictable quality of his art. These findings not -only deepen our empirical understanding of Sunstein's thesis but also introduce -a novel computational method for analyzing an artist's evolution-offering -broader applicability to the study of cultural and creative change. +Understanding the progression trajectories of diseases is crucial for early +diagnosis and effective treatment planning. This is especially vital for +life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a +chronic, progressive lung disease with a prognosis comparable to many cancers. +Computed tomography (CT) imaging has been established as a reliable diagnostic +tool for IPF. Accurately predicting future CT scans of early-stage IPF patients +can aid in developing better treatment strategies, thereby improving survival +outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial +Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF +patients at any time point. The model is trained using a two-stage approach. In +the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the +second stage, a Neural Ordinary Differential Equation (ODE) based temporal +model is trained to capture the temporal dynamics of the quantised embeddings +generated by the encoder in the first stage. We evaluate different +configurations of our model for generating longitudinal CT scans and compare +the results against ground truth data, both quantitatively and qualitatively. +For validation, we conduct survival analysis using imaging biomarkers derived +from generated CT scans and achieve a C-index comparable to that of biomarkers +derived from the real CT scans. The survival analysis results demonstrate the +potential clinical utility inherent to generated longitudinal CT scans, showing +that they can reliably predict survival outcomes. -摘要:卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格 --- 這種風格不斷拒絕符合預期,並熱衷於重新塑造他的音樂和歌詞認同。在本文中,我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析,來延伸桑斯坦的觀察。使用 o3-mini-high(一個大型語言模型),我從歌詞中提取概念對概念的關係,並建構有向知識圖,以捕捉迪倫的主題結構。然後,我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示,迪倫的歌詞越來越依賴隱喻,展現出不斷演化的情緒輪廓,並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現,對運動、抗議和神話意象的引用,會以與迪倫職業生涯中眾所周知階段一致的方式波動,反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解,也引入了分析藝術家演變的新穎運算方法,為文化和創造性變化的研究提供了更廣泛的適用性。 +摘要:了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要,IPF 是一種慢性、進行性肺部疾病,其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略,從而改善存活結果。在本文中,我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN),這是一個模型,能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段,訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段,訓練基於神經常微分方程 (ODE) 的時間模型,以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置,以生成縱向 CT 掃描,並在定量和定性方面將結果與真實數據進行比較。為了驗證,我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析,並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用,表明它們可以可靠地預測存活結果。 -##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos** -2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang +##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy** +2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho -Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in -enhancing Large Language Models (LLMs) through external knowledge integration, -yet its application has primarily focused on textual content, leaving the rich -domain of multi-modal video knowledge predominantly unexplored. This paper -introduces VideoRAG, the first retrieval-augmented generation framework -specifically designed for processing and understanding extremely long-context -videos. Our core innovation lies in its dual-channel architecture that -seamlessly integrates (i) graph-based textual knowledge grounding for capturing -cross-video semantic relationships, and (ii) multi-modal context encoding for -efficiently preserving visual features. This novel design empowers VideoRAG to -process unlimited-length videos by constructing precise knowledge graphs that -span multiple videos while maintaining semantic dependencies through -specialized multi-modal retrieval paradigms. Through comprehensive empirical -evaluation on our proposed LongerVideos benchmark-comprising over 160 videos -totaling 134+ hours across lecture, documentary, and entertainment -categories-VideoRAG demonstrates substantial performance compared to existing -RAG alternatives and long video understanding methods. The source code of -VideoRAG implementation and the benchmark dataset are openly available at: -https://github.com/HKUDS/VideoRAG. +The increasing demand for mental health services has led to the rise of +AI-driven mental health chatbots, though challenges related to privacy, data +collection, and expertise persist. Motivational Interviewing (MI) is gaining +attention as a theoretical basis for boosting expertise in the development of +these chatbots. However, existing datasets are showing limitations for training +chatbots, leading to a substantial demand for publicly available resources in +the field of MI and psychotherapy. These challenges are even more pronounced in +non-English languages, where they receive less attention. In this paper, we +propose a novel framework that simulates MI sessions enriched with the +expertise of professional therapists. We train an MI forecaster model that +mimics the behavioral choices of professional therapists and employ Large +Language Models (LLMs) to generate utterances through prompt engineering. Then, +we present KMI, the first synthetic dataset theoretically grounded in MI, +containing 1,000 high-quality Korean Motivational Interviewing dialogues. +Through an extensive expert evaluation of the generated dataset and the +dialogue model trained on it, we demonstrate the quality, expertise, and +practicality of KMI. We also introduce novel metrics derived from MI theory in +order to evaluate dialogues from the perspective of MI. -摘要:檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功,但其應用主要集中在文字內容上,而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG,這是第一個檢索增強生成架構,專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構,它無縫整合 (i) 基於圖形文字知識基礎,用於擷取跨影片語義關係,以及 (ii) 多模態語境編碼,用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片,同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估,該基準包含超過 160 部影片,總時數超過 134 小時,涵蓋演講、紀錄片和娛樂類別,VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比,展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於:https://github.com/HKUDS/VideoRAG。 +摘要:由於對心理健康服務的需求日益增加,導致以人工智慧為基礎的心理健康聊天機器人興起,儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而,現有的資料集顯示出訓練聊天機器人的限制,導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯,因為它們受到的關注較少。在本文中,我們提出了一個新穎的架構,它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型,它模擬了專業治療師的行為選擇,並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後,我們展示了 KMI,這是第一個理論上以 MI 為基礎的合成資料集,其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估,我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標,以便從 MI 的角度評估對話。 -##### **Transformers trained on proteins can learn to attend to Euclidean distance** -2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane +##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports** +2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco -While conventional Transformers generally operate on sequence data, they can -be used in conjunction with structure models, typically SE(3)-invariant or -equivariant graph neural networks (GNNs), for 3D applications such as protein -structure modelling. These hybrids typically involve either (1) -preprocessing/tokenizing structural features as input for Transformers or (2) -taking Transformer embeddings and processing them within a structural -representation. However, there is evidence that Transformers can learn to -process structural information on their own, such as the AlphaFold3 structural -diffusion model. In this work we show that Transformers can function -independently as structure models when passed linear embeddings of coordinates. -We first provide a theoretical explanation for how Transformers can learn to -filter attention as a 3D Gaussian with learned variance. We then validate this -theory using both simulated 3D points and in the context of masked token -prediction for proteins. Finally, we show that pre-training protein Transformer -encoders with structure improves performance on a downstream task, yielding -better performance than custom structural models. Together, this work provides -a basis for using standard Transformers as hybrid structure-language models. +Europe's healthcare systems require enhanced interoperability and +digitalization, driving a demand for innovative solutions to process legacy +clinical data. This paper presents the results of our project, which aims to +leverage Large Language Models (LLMs) to extract structured information from +unstructured clinical reports, focusing on patient history, diagnoses, +treatments, and other predefined categories. We developed a workflow with a +user interface and evaluated LLMs of varying sizes through prompting strategies +and fine-tuning. Our results show that fine-tuned smaller models match or +surpass larger counterparts in performance, offering efficiency for +resource-limited settings. A new dataset of 60,000 annotated English clinical +summaries and 24,000 German translations was validated with automated and +manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. +The work highlights the approach's viability and outlines future improvements. -摘要:雖然傳統的 Transformer 通常處理序列資料,但它們可用於結構模型,通常是 SE(3) 不變式或等變式圖神經網路 (GNN),用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而,有證據表明 Transformer 可以自行學習處理結構資訊,例如 AlphaFold3 結構擴散模型。在這項工作中,我們展示了 Transformer 在傳遞座標的線性嵌入時,可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後,我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能,產生比自訂結構模型更好的效能。綜合來說,這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。 +摘要:歐洲的醫療保健系統需要增強互通性和數位化,這驅動了對創新解決方案的需求,以處理傳統的臨床數據。本文介紹了我們專案的成果,該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊,重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程,並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示,微調後的較小模型在效能上與較大的模型相匹配或超越它們,為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性,並概述了未來的改進。 diff --git a/__pycache__/config.cpython-310.pyc b/__pycache__/config.cpython-310.pyc index 7476fee6de9c2576128fd2852d55c491bd2124ea..81c5faa2dc88083095c864e1b606ca4c9c30ab75 100644 GIT binary patch delta 19 Zcmdnav7Lh}pO=@50SKaxZRFBr0RSv|1StRj delta 19 Zcmdnav7Lh}pO=@50SM}aH*)E+001e|1DXH; diff --git a/__pycache__/util4translation.cpython-310.pyc b/__pycache__/util4translation.cpython-310.pyc index 6f43a990206725f06b7f53c7eed02057fec568de..a9fb52a25ebe728f09307d1c85bf817e4261ace2 100644 GIT binary patch delta 19 ZcmccbdEb*OpO=@50SMxbZREPD0suTd1;zjX delta 19 ZcmccbdEb*OpO=@50SKCeH*#H70RTCd1vdZy diff --git a/database/logs/runtime.log b/database/logs/runtime.log index a0f3ce8233..5e313cdba4 100644 --- a/database/logs/runtime.log +++ b/database/logs/runtime.log @@ -21438,3 +21438,7 @@ KeyError: 'paper_summary_zh' 2025-02-23 20:25:02.056 | SUCCESS | __main__:parse:267 - handle [2/4] | topic=`AI` subtopic=`Medical` 2025-02-23 20:25:02.608 | SUCCESS | __main__:parse:267 - handle [3/4] | topic=`AI` subtopic=`LLM` 2025-02-23 20:25:02.612 | SUCCESS | __main__:parse:267 - handle [4/4] | topic=`AI` subtopic=`Knowledge Graphs` +2025-02-24 09:08:05.732 | SUCCESS | __main__:parse:267 - handle [1/4] | topic=`AI` subtopic=`LLM` +2025-02-24 09:08:05.748 | SUCCESS | __main__:parse:267 - handle [2/4] | topic=`AI` subtopic=`Knowledge Graphs` +2025-02-24 09:08:05.850 | SUCCESS | __main__:parse:267 - handle [3/4] | topic=`AI` subtopic=`Medical explainable AI` +2025-02-24 09:08:05.914 | SUCCESS | __main__:parse:267 - handle [4/4] | topic=`AI` subtopic=`Medical` diff --git a/database/storage/storage_2025-02-24.md b/database/storage/storage_2025-02-24.md new file mode 100644 index 0000000000..daca996fb7 --- /dev/null +++ b/database/storage/storage_2025-02-24.md @@ -0,0 +1,10150 @@ +# arxiv-daily + Automated deployment @ 2025-02-24 09:08:05 Asia/Taipei +> Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml). +> You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage). + +## AI + +### LLM +|Publish Date|Title|Authors|Homepage|Code| +| :---: | :---: | :---: | :---: | :---: | +|**2025-02-20**|**LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention**|Shang Yang et.al.|[2502.14866v1](http://arxiv.org/abs/2502.14866v1)|null| +|**2025-02-20**|**Interpretable Text Embeddings and Text Similarity Explanation: A Primer**|Juri Opitz et.al.|[2502.14862v1](http://arxiv.org/abs/2502.14862v1)|null| +|**2025-02-20**|**Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning**|Shuyue Stella Li et.al.|[2502.14860v1](http://arxiv.org/abs/2502.14860v1)|null| +|**2025-02-20**|**FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling**|Weilin Zhao et.al.|[2502.14856v1](http://arxiv.org/abs/2502.14856v1)|null| +|**2025-02-20**|**Prompt-to-Leaderboard**|Evan Frick et.al.|[2502.14855v1](http://arxiv.org/abs/2502.14855v1)|null| +|**2025-02-20**|**CLIPPER: Compression enables long-context synthetic data generation**|Chau Minh Pham et.al.|[2502.14854v1](http://arxiv.org/abs/2502.14854v1)|null| +|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| +|**2025-02-20**|**Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation**|Yue Yang et.al.|[2502.14846v1](http://arxiv.org/abs/2502.14846v1)|null| +|**2025-02-20**|**Revealing and Mitigating Over-Attention in Knowledge Editing**|Pinzheng Wang et.al.|[2502.14838v1](http://arxiv.org/abs/2502.14838v1)|null| +|**2025-02-20**|**Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs**|Tao Ji et.al.|[2502.14837v1](http://arxiv.org/abs/2502.14837v1)|null| +|**2025-02-20**|**LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models**|Shangqing Tu et.al.|[2502.14834v1](http://arxiv.org/abs/2502.14834v1)|null| +|**2025-02-20**|**Improving the Diffusability of Autoencoders**|Ivan Skorokhodov et.al.|[2502.14831v1](http://arxiv.org/abs/2502.14831v1)|null| +|**2025-02-20**|**Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs**|Danni Liu et.al.|[2502.14830v1](http://arxiv.org/abs/2502.14830v1)|null| +|**2025-02-20**|**Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps**|Martin Tutek et.al.|[2502.14829v1](http://arxiv.org/abs/2502.14829v1)|null| +|**2025-02-20**|**Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison**|Aiswarya Baby et.al.|[2502.14827v1](http://arxiv.org/abs/2502.14827v1)|null| +|**2025-02-20**|**eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables**|Luis Antonio Gutiérrez Guanilo et.al.|[2502.14820v1](http://arxiv.org/abs/2502.14820v1)|null| +|**2025-02-20**|**Optimizing Model Selection for Compound AI Systems**|Lingjiao Chen et.al.|[2502.14815v1](http://arxiv.org/abs/2502.14815v1)|null| +|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| +|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| +|**2025-02-20**|**A Survey on Text-Driven 360-Degree Panorama Generation**|Hai Wang et.al.|[2502.14799v1](http://arxiv.org/abs/2502.14799v1)|null| +|**2025-02-20**|**Rapid Word Learning Through Meta In-Context Learning**|Wentao Wang et.al.|[2502.14791v1](http://arxiv.org/abs/2502.14791v1)|null| +|**2025-02-20**|**SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features**|Michael Tschannen et.al.|[2502.14786v1](http://arxiv.org/abs/2502.14786v1)|[link](https://github.com/google-research/big_vision)| +|**2025-02-20**|**ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting**|Abhijit Mishra et.al.|[2502.14780v1](http://arxiv.org/abs/2502.14780v1)|null| +|**2025-02-20**|**Harnessing PDF Data for Improving Japanese Large Multimodal Models**|Jeonghun Baek et.al.|[2502.14778v1](http://arxiv.org/abs/2502.14778v1)|null| +|**2025-02-20**|**Making Universal Policies Universal**|Niklas Höpner et.al.|[2502.14777v1](http://arxiv.org/abs/2502.14777v1)|null| +|**2025-02-20**|**SurveyX: Academic Survey Automation via Large Language Models**|Xun Liang et.al.|[2502.14776v1](http://arxiv.org/abs/2502.14776v1)|null| +|**2025-02-20**|**Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning**|Tian Xie et.al.|[2502.14768v1](http://arxiv.org/abs/2502.14768v1)|null| +|**2025-02-20**|**Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis**|Priyanka Kargupta et.al.|[2502.14767v1](http://arxiv.org/abs/2502.14767v1)|null| +|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| +|**2025-02-20**|**EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations**|Haotian Zhai et.al.|[2502.14760v1](http://arxiv.org/abs/2502.14760v1)|null| +|**2025-02-20**|**On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems**|Juraj Vladika et.al.|[2502.14759v1](http://arxiv.org/abs/2502.14759v1)|null| +|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| +|**2025-02-20**|**TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators**|Jianling Li et.al.|[2502.14752v1](http://arxiv.org/abs/2502.14752v1)|null| +|**2025-02-20**|**Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs**|Zongxia Li et.al.|[2502.14748v1](http://arxiv.org/abs/2502.14748v1)|null| +|**2025-02-20**|**HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States**|Yilei Jiang et.al.|[2502.14744v1](http://arxiv.org/abs/2502.14744v1)|null| +|**2025-02-20**|**Multi-Agent Coordination across Diverse Applications: A Survey**|Lijun Sun et.al.|[2502.14743v1](http://arxiv.org/abs/2502.14743v1)|null| +|**2025-02-20**|**YOLOv12: A Breakdown of the Key Architectural Features**|Mujadded Al Rabbani Alif et.al.|[2502.14740v1](http://arxiv.org/abs/2502.14740v1)|null| +|**2025-02-20**|**SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines**|M-A-P Team et.al.|[2502.14739v1](http://arxiv.org/abs/2502.14739v1)|null| +|**2025-02-20**|**EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration**|Minjie Hong et.al.|[2502.14735v1](http://arxiv.org/abs/2502.14735v1)|null| +|**2025-02-20**|**Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models**|Hongji Li et.al.|[2502.14734v1](http://arxiv.org/abs/2502.14734v1)|null| +|**2025-02-20**|**WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models**|Yifu Chen et.al.|[2502.14727v1](http://arxiv.org/abs/2502.14727v1)|null| +|**2025-02-20**|**Entity Framing and Role Portrayal in the News**|Tarek Mahmoud et.al.|[2502.14718v1](http://arxiv.org/abs/2502.14718v1)|null| +|**2025-02-20**|**From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT**|Ahmed Abdeen Hamed et.al.|[2502.14714v1](http://arxiv.org/abs/2502.14714v1)|null| +|**2025-02-20**|**Data-Efficient Pretraining with Group-Level Data Influence Modeling**|Zichun Yu et.al.|[2502.14709v1](http://arxiv.org/abs/2502.14709v1)|null| +|**2025-02-20**|**Human Misperception of Generative-AI Alignment: A Laboratory Experiment**|Kevin He et.al.|[2502.14708v1](http://arxiv.org/abs/2502.14708v1)|null| +|**2025-02-20**|**Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting**|Yuxuan Yang et.al.|[2502.14704v1](http://arxiv.org/abs/2502.14704v1)|null| +|**2025-02-20**|**I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search**|Zujie Liang et.al.|[2502.14693v1](http://arxiv.org/abs/2502.14693v1)|null| +|**2025-02-20**|**Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup**|Yonghui Kong et.al.|[2502.14682v1](http://arxiv.org/abs/2502.14682v1)|null| +|**2025-02-20**|**How to Get Your LLM to Generate Challenging Problems for Evaluation**|Arkil Patel et.al.|[2502.14678v1](http://arxiv.org/abs/2502.14678v1)|null| +|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| +|**2025-02-20**|**BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction**|Ruochen Li et.al.|[2502.14676v1](http://arxiv.org/abs/2502.14676v1)|null| +|**2025-02-20**|**Explanations of Deep Language Models Explain Language Representations in the Brain**|Maryam Rahimi et.al.|[2502.14671v1](http://arxiv.org/abs/2502.14671v1)|null| +|**2025-02-20**|**AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO**|Alan Dao et.al.|[2502.14669v1](http://arxiv.org/abs/2502.14669v1)|null| +|**2025-02-20**|**InstructAgent: Building User Controllable Recommender via LLM Agent**|Wujiang Xu et.al.|[2502.14662v1](http://arxiv.org/abs/2502.14662v1)|null| +|**2025-02-20**|**Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs**|Yuchen Wu et.al.|[2502.14645v1](http://arxiv.org/abs/2502.14645v1)|null| +|**2025-02-20**|**LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning**|Yansheng Mao et.al.|[2502.14644v1](http://arxiv.org/abs/2502.14644v1)|null| +|**2025-02-20**|**Length-Controlled Margin-Based Preference Optimization without Reference Model**|Gengxu Li et.al.|[2502.14643v1](http://arxiv.org/abs/2502.14643v1)|null| +|**2025-02-20**|**How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation**|Rui Li et.al.|[2502.14642v1](http://arxiv.org/abs/2502.14642v1)|null| +|**2025-02-20**|**NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization**|Zheyuan Zhang et.al.|[2502.14638v1](http://arxiv.org/abs/2502.14638v1)|null| +|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| +|**2025-02-20**|**PEARL: Towards Permutation-Resilient LLMs**|Liang Chen et.al.|[2502.14628v1](http://arxiv.org/abs/2502.14628v1)|null| +|**2025-02-20**|**ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors**|Yuguo Yin et.al.|[2502.14627v1](http://arxiv.org/abs/2502.14627v1)|null| +|**2025-02-20**|**Multi-Record Web Page Information Extraction From News Websites**|Alexander Kustenkov et.al.|[2502.14625v1](http://arxiv.org/abs/2502.14625v1)|null| +|**2025-02-20**|**Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity**|Xinghan Pan et.al.|[2502.14620v1](http://arxiv.org/abs/2502.14620v1)|[link](https://github.com/PStarH/RWKV-embedding)| +|**2025-02-20**|**Reward Models Identify Consistency, Not Causality**|Yuhui Xu et.al.|[2502.14619v1](http://arxiv.org/abs/2502.14619v1)|null| +|**2025-02-20**|**FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis**|Mingyi Jia et.al.|[2502.14614v1](http://arxiv.org/abs/2502.14614v1)|null| +|**2025-02-20**|**Behavioral Analysis of Information Salience in Large Language Models**|Jan Trienes et.al.|[2502.14613v1](http://arxiv.org/abs/2502.14613v1)|null| +|**2025-02-20**|**A Theory for Conditional Generative Modeling on Multiple Data Sources**|Rongzhen Wang et.al.|[2502.14583v1](http://arxiv.org/abs/2502.14583v1)|null| +|**2025-02-20**|**A Statistical Case Against Empirical Human-AI Alignment**|Julian Rodemann et.al.|[2502.14581v1](http://arxiv.org/abs/2502.14581v1)|null| +|**2025-02-20**|**ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification**|Hyunseok Lee et.al.|[2502.14565v1](http://arxiv.org/abs/2502.14565v1)|null| +|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| +|**2025-02-20**|**Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs**|Paris Koloveas et.al.|[2502.14561v1](http://arxiv.org/abs/2502.14561v1)|null| +|**2025-02-20**|**Less is More: Improving LLM Alignment via Preference Data Selection**|Xun Deng et.al.|[2502.14560v1](http://arxiv.org/abs/2502.14560v1)|null| +|**2025-02-20**|**FUIA: Model Inversion Attack against Federated Unlearning**|Lei Zhou et.al.|[2502.14558v1](http://arxiv.org/abs/2502.14558v1)|null| +|**2025-02-20**|**Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling**|Eric Egli et.al.|[2502.14553v1](http://arxiv.org/abs/2502.14553v1)|null| +|**2025-02-20**|**Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks**|Maya Bechler-Speicher et.al.|[2502.14546v1](http://arxiv.org/abs/2502.14546v1)|null| +|**2025-02-20**|**LLM-based User Profile Management for Recommender System**|Seunghwan Bang et.al.|[2502.14541v1](http://arxiv.org/abs/2502.14541v1)|null| +|**2025-02-20**|**LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization**|Yupeng Chang et.al.|[2502.14538v1](http://arxiv.org/abs/2502.14538v1)|null| +|**2025-02-20**|**CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models**|Zhenhong Zhou et.al.|[2502.14529v1](http://arxiv.org/abs/2502.14529v1)|null| +|**2025-02-20**|**Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting**|Yannick Wölker et.al.|[2502.14525v1](http://arxiv.org/abs/2502.14525v1)|null| +|**2025-02-20**|**Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation**|Austin A. Barr et.al.|[2502.14523v1](http://arxiv.org/abs/2502.14523v1)|null| +|**2025-02-20**|**MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality**|Artur Kot et.al.|[2502.14509v1](http://arxiv.org/abs/2502.14509v1)|null| +|**2025-02-20**|**Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases**|Rena Gao et.al.|[2502.14507v1](http://arxiv.org/abs/2502.14507v1)|null| +|**2025-02-20**|**PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models**|Yu Meng et.al.|[2502.14504v1](http://arxiv.org/abs/2502.14504v1)|null| +|**2025-02-20**|**How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?**|Sergey Pletenev et.al.|[2502.14502v1](http://arxiv.org/abs/2502.14502v1)|null| +|**2025-02-20**|**Towards a Perspectivist Turn in Argument Quality Assessment**|Julia Romberg et.al.|[2502.14501v1](http://arxiv.org/abs/2502.14501v1)|null| +|**2025-02-20**|**MLGym: A New Framework and Benchmark for Advancing AI Research Agents**|Deepak Nathani et.al.|[2502.14499v1](http://arxiv.org/abs/2502.14499v1)|null| +|**2025-02-20**|**Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups**|Felix Drinkall et.al.|[2502.14497v1](http://arxiv.org/abs/2502.14497v1)|null| +|**2025-02-20**|**Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization**|Zhitao He et.al.|[2502.14496v1](http://arxiv.org/abs/2502.14496v1)|null| +|**2025-02-20**|**StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following**|Jinnan Li et.al.|[2502.14494v1](http://arxiv.org/abs/2502.14494v1)|null| +|**2025-02-20**|**Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk**|Elija Perrier et.al.|[2502.14491v1](http://arxiv.org/abs/2502.14491v1)|null| +|**2025-02-20**|**Temporal Misalignment and Probabilistic Neurons**|Velibor Bojković et.al.|[2502.14487v1](http://arxiv.org/abs/2502.14487v1)|null| +|**2025-02-20**|**How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation**|Zhuohang Long et.al.|[2502.14486v1](http://arxiv.org/abs/2502.14486v1)|null| +|**2025-02-20**|**NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models**|Chenlu Guo et.al.|[2502.14482v1](http://arxiv.org/abs/2502.14482v1)|null| +|**2025-02-20**|**Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression**|Haoyu Wang et.al.|[2502.14477v1](http://arxiv.org/abs/2502.14477v1)|null| +|**2025-02-20**|**Argument-Based Comparative Question Answering Evaluation Benchmark**|Irina Nikishina et.al.|[2502.14476v1](http://arxiv.org/abs/2502.14476v1)|null| +|**2025-02-20**|**Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models**|Aurora Polo-Rodríguez et.al.|[2502.14469v1](http://arxiv.org/abs/2502.14469v1)|null| +|**2025-02-20**|**Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing**|Aviv Bick et.al.|[2502.14458v1](http://arxiv.org/abs/2502.14458v1)|null| +|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| +|**2025-02-20**|**Optimal word order for non-causal text generation with Large Language Models: the Spanish case**|Andrea Busto-Castiñeira et.al.|[2502.14451v1](http://arxiv.org/abs/2502.14451v1)|null| + +#### Abstracts +##### **LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention** +2502.14866v1 by Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han + +Large language models (LLMs) have shown remarkable potential in processing +long sequences, yet efficiently serving these long-context models remains +challenging due to the quadratic computational complexity of attention in the +prefilling stage and the large memory footprint of the KV cache in the decoding +stage. To address these issues, we introduce LServe, an efficient system that +accelerates long-sequence LLM serving via hybrid sparse attention. This method +unifies different hardware-friendly, structured sparsity patterns for both +prefilling and decoding attention into a single framework, where computations +on less important tokens are skipped block-wise. LServe demonstrates the +compatibility of static and dynamic sparsity in long-context LLM attention. +This design enables multiplicative speedups by combining these optimizations. +Specifically, we convert half of the attention heads to nearly free streaming +heads in both the prefilling and decoding stages. Additionally, we find that +only a constant number of KV pages is required to preserve long-context +capabilities, irrespective of context length. We then design a hierarchical KV +page selection policy that dynamically prunes KV pages based on query-centric +similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and +decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is +released at https://github.com/mit-han-lab/omniserve. + +摘要:大型語言模型 (LLM) 在處理長序列方面展現出驚人的潛力,但由於預填充階段注意力的二次計算複雜度和解碼階段 KV 快取的大量記憶體使用量,有效提供這些長語境模型服務仍然具有挑戰性。為了解決這些問題,我們引入了 LServe,一個透過混合稀疏注意力加速長序列 LLM 服務的高效系統。此方法將不同的硬體友善的結構化稀疏模式統一到一個單一的架構中,用於預填充和解碼注意力,其中對較不重要的符號的運算會以區塊方式略過。LServe 證明了靜態和動態稀疏性在長語境 LLM 注意力中的相容性。此設計透過結合這些最佳化來實現倍增加速。具體來說,我們將一半的注意力頭轉換為預填充和解碼階段中幾乎免費的串流頭。此外,我們發現僅需要恆定的 KV 頁數來保留長語境功能,而與語境長度無關。然後,我們設計了一個分層式 KV 頁面選擇策略,根據以查詢為中心的相似性動態刪除 KV 頁面。平均而言,LServe 將 LLM 預填充加速了 2.9 倍,將解碼加速了 1.3-2.1 倍,同時維持長語境的準確性。程式碼已發布在 https://github.com/mit-han-lab/omniserve。 + +##### **Interpretable Text Embeddings and Text Similarity Explanation: A Primer** +2502.14862v1 by Juri Opitz, Lucas Möller, Andrianos Michail, Simon Clematide + +Text embeddings and text embedding models are a backbone of many AI and NLP +systems, particularly those involving search. However, interpretability +challenges persist, especially in explaining obtained similarity scores, which +is crucial for applications requiring transparency. In this paper, we give a +structured overview of interpretability methods specializing in explaining +those similarity scores, an emerging research area. We study the methods' +individual ideas and techniques, evaluating their potential for improving +interpretability of text embeddings and explaining predicted similarities. + +摘要:文字嵌入和文字嵌入模型是許多 AI 和 NLP 系統的骨幹,特別是那些涉及搜尋的系統。然而,可解釋性的挑戰依然存在,特別是在解釋獲得的相似度分數時,這對於需要透明度的應用程式至關重要。在本文中,我們對專門用於解釋這些相似度分數的可解釋性方法給予結構化的概述,這是一個新興的研究領域。我們研究了這些方法的個別想法和技術,評估它們改善文字嵌入的可解釋性和解釋預測相似度的潛力。 + +##### **Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning** +2502.14860v1 by Shuyue Stella Li, Jimin Mun, Faeze Brahman, Jonathan S. Ilgen, Yulia Tsvetkov, Maarten Sap + +Large language models (LLMs) often fail to ask effective questions under +uncertainty, making them unreliable in domains where proactive +information-gathering is essential for decisionmaking. We present ALFA, a +framework that improves LLM question-asking by (i) decomposing the notion of a +"good" question into a set of theory-grounded attributes (e.g., clarity, +relevance), (ii) controllably synthesizing attribute-specific question +variations, and (iii) aligning models via preference-based optimization to +explicitly learn to ask better questions along these fine-grained attributes. +Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs +dataset, composed of 17k real-world clinical interactions augmented with 80k +attribute-specific preference pairs of follow-up questions, as well as a novel +expert-annotated interactive healthcare QA task to evaluate question-asking +abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on +MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level +win-rate of 64.4% and strong generalizability. Our findings suggest that +explicitly guiding question-asking with structured, fine-grained attributes +offers a scalable path to improve LLMs, especially in expert application +domains. + +摘要:大型語言模型 (LLM) 經常在不確定性下無法提出有效問題,這使得它們在主動收集資訊對於決策制定至關重要的領域中不可靠。我們提出 ALFA,一個透過 (i) 將「良好」問題的概念分解成一組以理論為基礎的屬性(例如,清晰度、相關性),(ii) 可控地合成屬性特定的問題變體,以及 (iii) 透過基於偏好的最佳化調整模型,明確學習沿著這些細緻屬性提出更好的問題,來改善 LLM 提問的架構。專注於臨床推理作為案例研究,我們引入了 MediQ-AskDocs 資料集,由 17k 個真實世界的臨床互動組成,並增加了 80k 個屬性特定的後續問題偏好配對,以及一個由專家註解的互動式醫療保健問答任務來評估提問能力。與 SOTA 指令調整的 LLM 相比,與 ALFA 對齊的模型將 MediQ-AskDocs 上的診斷錯誤減少了 56.6%,問題層級的勝率為 64.4%,並且具有很強的普遍性。我們的研究結果表明,明確地以結構化、細緻的屬性來引導提問,提供了一條可擴充的途徑來改善 LLM,特別是在專家應用領域。 + +##### **FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling** +2502.14856v1 by Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun + +Speculative sampling has emerged as an important technique for accelerating +the auto-regressive generation process of large language models (LLMs) by +utilizing a draft-then-verify mechanism to produce multiple tokens per forward +pass. While state-of-the-art speculative sampling methods use only a single +layer and a language modeling (LM) head as the draft model to achieve +impressive layer compression, their efficiency gains are substantially reduced +for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. +To address this, we present FR-Spec, a frequency-ranked speculative sampling +framework that optimizes draft candidate selection through vocabulary space +compression. By constraining the draft search to a frequency-prioritized token +subset, our method reduces LM Head computation overhead by 75% while ensuring +the equivalence of the final output distribution. Experiments across multiple +datasets demonstrate an average of 1.12$\times$ speedup over the +state-of-the-art speculative sampling method EAGLE-2. + +摘要:推測取樣已成為一種重要的技術,可用於透過利用先起草後驗證的機制來加速大型語言模型 (LLM) 的自迴歸生成過程,並在每次前向傳遞中產生多個代幣。儘管最先進的推測取樣方法只使用單一層和語言建模 (LM) 頭作為起草模型,以達成令人印象深刻的層壓縮,但對於大型詞彙表 LLM(例如詞彙表包含 128k 個代幣的 Llama-3-8B),其效率提升會大幅降低。為了解決這個問題,我們提出了 FR-Spec,這是一種頻率排序推測取樣架構,它透過詞彙空間壓縮來最佳化起草候選選取。我們的這個方法透過將起草搜尋限制在優先於頻率的代幣子集中,將 LM 頭部運算開銷減少了 75%,同時確保最終輸出分佈的等效性。透過多個資料集的實驗證明,與最先進的推測取樣方法 EAGLE-2 相比,平均提速了 1.12 倍。 + +##### **Prompt-to-Leaderboard** +2502.14855v1 by Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica + +Large language model (LLM) evaluations typically rely on aggregated metrics +like accuracy or human preference, averaging across users and prompts. This +averaging obscures user- and prompt-specific variations in model performance. +To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces +leaderboards specific to a prompt. The core idea is to train an LLM taking +natural language prompts as input to output a vector of Bradley-Terry +coefficients which are then used to predict the human preference vote. The +resulting prompt-dependent leaderboards allow for unsupervised task-specific +evaluation, optimal routing of queries to models, personalization, and +automated evaluation of model strengths and weaknesses. Data from Chatbot Arena +suggest that P2L better captures the nuanced landscape of language model +performance than the averaged leaderboard. Furthermore, our findings suggest +that P2L's ability to produce prompt-specific evaluations follows a power law +scaling similar to that observed in LLMs themselves. In January 2025, the +router we trained based on this methodology achieved the \#1 spot in the +Chatbot Arena leaderboard. Our code is available at this GitHub link: +https://github.com/lmarena/p2l. + +摘要:大型語言模型 (LLM) 評估通常依賴於彙總的指標,例如準確性或人類偏好,平均值跨使用者和提示。此平均值模糊了使用者和提示特定的模型效能變異。為了解決此問題,我們提出提示到排行榜 (P2L),一種產生特定於提示的排行榜的方法。核心概念是訓練 LLM,將自然語言提示作為輸入,以輸出 Bradley-Terry 係數向量,然後用於預測人類偏好投票。產生的提示相關排行榜允許無監督任務特定評估、最佳查詢路由至模型、個人化以及模型優缺點的自動化評估。來自 Chatbot Arena 的資料表明,P2L 比平均排行榜更能捕捉語言模型效能的細微變化。此外,我們的研究結果表明,P2L 產生提示特定評估的能力遵循類似於 LLM 本身觀察到的冪律縮放。2025 年 1 月,我們根據此方法訓練的路由器在 Chatbot Arena 排行榜中獲得了第一名。我們的程式碼可在 GitHub 連結取得:https://github.com/lmarena/p2l。 + +##### **CLIPPER: Compression enables long-context synthetic data generation** +2502.14854v1 by Chau Minh Pham, Yapei Chang, Mohit Iyyer + +LLM developers are increasingly reliant on synthetic data, but generating +high-quality data for complex long-context reasoning tasks remains challenging. +We introduce CLIPPER, a compression-based approach for generating synthetic +data tailored to narrative claim verification - a task that requires reasoning +over a book to verify a given claim. Instead of generating claims directly from +the raw text of the book, which results in artifact-riddled claims, CLIPPER +first compresses the book into chapter outlines and book summaries and then +uses these intermediate representations to generate complex claims and +corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces +claims that are more valid, grounded, and complex. Using CLIPPER, we construct +a dataset of 19K synthetic book claims paired with their source texts and +chain-of-thought reasoning, and use it to fine-tune three open-weight models. +Our best model achieves breakthrough results on narrative claim verification +(from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for +sub-10B models on the NoCha leaderboard. Further analysis shows that our models +generate more detailed and grounded chain-of-thought reasoning while also +improving performance on other narrative understanding tasks (e.g., +NarrativeQA). + +摘要:LLM 開發人員越來越依賴合成資料,但為複雜的長語境推理任務生成高品質資料仍然具有挑戰性。我們引入了 CLIPPER,一種基於壓縮的方法,用於生成針對敘事性聲明驗證量身打造的合成資料,這項任務需要對一本書進行推理才能驗證給定的聲明。CLIPPER 沒有直接從書籍的原始文字生成聲明,這會產生充滿人工製品的聲明,而是先將書籍壓縮成章節大綱和書籍摘要,然後使用這些中間表示來生成複雜的聲明和對應的思維鏈。與天真的方法相比,CLIPPER 產生的聲明更有效、更有根據且更複雜。使用 CLIPPER,我們構建了一個包含 19K 個合成書籍聲明及其原始文字和思維鏈推理的資料集,並用於微調三個開放權重模型。我們最好的模型在敘事性聲明驗證方面取得了突破性的結果(在我們的測試集中準確率從 28% 提升到 76%),並在 NoCha 排行榜上為低於 10B 的模型設定了新的技術水準。進一步的分析表明,我們的模型生成了更詳細且有根據的思維鏈推理,同時也提高了其他敘事理解任務(例如 NarrativeQA)的效能。 + +##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** +2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu + +Large Language Models (LLMs) have shown great promise in tool-making, yet +existing frameworks often struggle to efficiently construct reliable toolsets +and are limited to single-task settings. To address these challenges, we +propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that +dynamically constructs and evolves a hierarchical graph of reusable tools +across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), +agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, +TabMWP). Our results show that GATE achieves up to 4.3x faster milestone +completion in Minecraft compared to the previous SOTA, and provides an average +improvement of 9.23% over existing tool-making methods in code generation tasks +and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, +balancing tool quantity, complexity, and functionality while maintaining high +efficiency. Code and data are available at +\url{https://github.com/ayanami2003/GATE}. + +摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 + +##### **Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation** +2502.14846v1 by Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark + +Reasoning about images with rich text, such as charts and documents, is a +critical application of vision-language models (VLMs). However, VLMs often +struggle in these domains due to the scarcity of diverse text-rich +vision-language data. To address this challenge, we present CoSyn, a framework +that leverages the coding capabilities of text-only large language models +(LLMs) to automatically create synthetic text-rich multimodal data. Given input +text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts +an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic +images. With the underlying code as textual representations of the synthetic +images, CoSyn can generate high-quality instruction-tuning data, again relying +on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K +images and 2.7M rows of vision-language instruction-tuning data. Comprehensive +experiments on seven benchmarks demonstrate that models trained on our +synthetic data achieve state-of-the-art performance among competitive +open-source models, including Llama 3.2, and surpass proprietary models such as +GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing +data, enabling VLMs to ground information within input images, showcasing its +potential for developing multimodal agents capable of acting in real-world +environments. + +摘要:透過豐富文字(例如圖表和文件)對影像進行推理,是視覺語言模型 (VLM) 的重要應用。然而,由於多元化文字豐富的視覺語言資料稀少,VLM 在這些領域中經常會遇到困難。為了應對這個挑戰,我們提出了 CoSyn,一個利用純文字大型語言模型 (LLM) 的編碼能力來自動建立合成文字豐富多模態資料的架構。給定描述目標網域的輸入文字(例如「營養成分標籤」),CoSyn 會提示 LLM 產生用於合成影像渲染的程式碼(Python、HTML、LaTeX 等)。透過將底層程式碼作為合成影像的文字表示,CoSyn 可以產生高品質的指令調整資料,再次依賴純文字 LLM。使用 CoSyn,我們建構了一個包含 40 萬張影像和 270 萬列視覺語言指令調整資料的資料集。在七個基準上的全面實驗證明,在我們的合成資料上訓練的模型在競爭對手的開源模型(包括 Llama 3.2)中達到了最先進的效能,並超越了 GPT-4V 和 Gemini 1.5 Flash 等專有模型。此外,CoSyn 可以產生合成指向資料,讓 VLM 能在輸入影像中建立資訊基礎,展示其在開發能夠在真實世界環境中運作的多模態代理方面的潛力。 + +##### **Revealing and Mitigating Over-Attention in Knowledge Editing** +2502.14838v1 by Pinzheng Wang, Zecheng Tang, Keyan Zhou, Juntao Li, Qiaoming Zhu, Min Zhang + +Large Language Models have demonstrated superior performance across a wide +range of tasks, but they still exhibit undesirable errors due to incorrect +knowledge learned from the training data. To avoid this, knowledge editing +methods emerged to precisely edit the specific model knowledge via efficiently +modifying a very small percentage of parameters. % However, those methods can +lead to the problem of Specificity Failure: when the content related to the +edited knowledge occurs in the context, it can inadvertently corrupt other +pre-existing knowledge. However, those methods can lead to the problem of +Specificity Failure, where the existing knowledge and capabilities are severely +degraded due to editing. Our preliminary indicates that Specificity Failure +primarily stems from the model's attention heads assigning excessive attention +scores to entities related to the edited knowledge, thereby unduly focusing on +specific snippets within the context, which we denote as the Attention Drift +phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet +effective method Selective Attention Drift Restriction}(SADR), which introduces +an additional regularization term during the knowledge editing process to +restrict changes in the attention weight distribution, thereby preventing undue +focus on the edited entity. Experiments on five frequently used strong LLMs +demonstrate the effectiveness of our method, where SADR can significantly +mitigate Specificity Failure in the predominant knowledge editing tasks. + +摘要:大型語言模型已在廣泛任務中展現出卓越的效能,但由於從訓練資料中學習到不正確的知識,它們仍會出現令人不滿意的錯誤。為避免此情況,知識編輯方法應運而生,透過有效修改極少數參數來精準編輯特定模型知識。% 然而,這些方法可能會導致特異性失敗問題:當與已編輯知識相關的內容出現在文中時,可能會無意間損害其他既有知識。然而,這些方法可能會導致特異性失敗問題,因為現有知識和能力會因編輯而嚴重降低。我們的初步研究表明,特異性失敗主要源於模型的注意力權重將過度注意力分數分配給與已編輯知識相關的實體,從而過度關注文中特定的片段,我們將此現象稱為注意力偏移。為減輕這種注意力偏移問題,我們引入了一個簡單但有效的方法選擇性注意力偏移限制}(SADR),在知識編輯過程中引入一個額外的正則化項來限制注意力權重分配的變動,從而防止過度關注已編輯實體。在五個經常使用的強大 LLM 上進行的實驗證明了我們方法的有效性,其中 SADR 可以顯著減輕主要知識編輯任務中的特異性失敗。 + +##### **Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs** +2502.14837v1 by Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui + +Multi-head Latent Attention (MLA) is an innovative architecture proposed by +DeepSeek, designed to ensure efficient and economical inference by +significantly compressing the Key-Value (KV) cache into a latent vector. +Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its +variants such as Grouped-Query Attention (GQA) exhibit significant cost +disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA +without pre-training from scratch is both meaningful and challenging. This +paper proposes the first data-efficient fine-tuning method for transitioning +from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, +we remove RoPE from dimensions of queries and keys that contribute less to the +attention scores, for low-rank approximation, we introduce joint SVD +approximations based on the pre-trained parameters of keys and values. These +carefully designed strategies enable MHA2MLA to recover performance using only +a small fraction (0.3% to 0.6%) of the data, significantly reducing inference +costs while seamlessly integrating with compression techniques such as KV cache +quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, +with only a 0.5% drop in LongBench performance. + +摘要:多頭潛在注意力 (MLA) 是 DeepSeek 提出的一種創新架構,旨在通過將鍵值 (KV) 快取大幅壓縮成潛在向量,確保有效率且經濟的推論。與 MLA 相比,採用多頭注意力 (MHA) 及其變體(例如分組查詢注意力 (GQA))的標準 LLM 會出現顯著的成本劣勢。讓訓練完善的 LLM(例如 Llama)能夠快速適應 MLA,而無需從頭開始預訓練,這既有意義又具有挑戰性。本文提出了第一個資料有效微調方法,用於從 MHA 轉換到 MLA (MHA2MLA),其中包含兩個關鍵組成部分:對於部分 RoPE,我們從查詢和鍵的維度中移除對注意力分數貢獻較小的 RoPE,對於低秩近似,我們基於鍵和值的預訓練參數引入聯合 SVD 近似。這些經過仔細設計的策略讓 MHA2MLA 能夠僅使用一小部分資料 (0.3% 至 0.6%) 來恢復效能,大幅降低推論成本,同時與壓縮技術(例如 KV 快取量化)無縫整合。例如,Llama2-7B 的 KV 快取大小減少了 92.19%,而 LongBench 效能僅下降了 0.5%。 + +##### **LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models** +2502.14834v1 by Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li + +Existing Large Vision-Language Models (LVLMs) can process inputs with context +lengths up to 128k visual and text tokens, yet they struggle to generate +coherent outputs beyond 1,000 words. We find that the primary limitation is the +absence of long output examples during supervised fine-tuning (SFT). To tackle +this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 +examples, each with multiple input images, an instruction, and corresponding +outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that +maintain high-fidelity to the input images, we employ Direct Preference +Optimization (DPO) to the SFT model. Given the high cost of collecting human +feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which +breaks long outputs into segments and uses iterative corrections to form +preference pairs with the original outputs. Additionally, we develop +MMLongBench-Write, a benchmark featuring six tasks to evaluate the +long-generation capabilities of VLMs. Our 7B parameter model, trained with +LongWriter-V-22k and IterDPO, achieves impressive performance on this +benchmark, outperforming larger proprietary models like GPT-4o. Code and data: +https://github.com/THU-KEG/LongWriter-V + +摘要:現有的大型視覺語言模型 (LVLMs) 能處理長度達 128k 視覺和文字符號的輸入內容,但卻難以產生超過 1,000 字的連貫輸出。我們發現,主要限制在於監督微調 (SFT) 期間缺少長輸出範例。為了解決此問題,我們引入了 LongWriter-V-22k,這是一個 SFT 資料集,包含 22,158 個範例,每個範例都有多個輸入影像、一個說明和對應的輸出,範圍從 0 到 10,000 字。此外,為了產生與輸入影像高度保真的長輸出,我們對 SFT 模型採用直接偏好最佳化 (DPO)。考量到收集人類回饋的成本很高(例如 3,000 字),我們提出 IterDPO,它會將長輸出區分成幾個區塊,並使用反覆修正來形成與原始輸出的偏好配對。此外,我們開發了 MMLongBench-Write,這是一個基準,包含六項任務,用於評估 VLM 的長生成能力。我們的 7B 參數模型使用 LongWriter-V-22k 和 IterDPO 進行訓練,在這個基準上取得令人印象深刻的效能,超越了 GPT-4o 等大型專有模型。程式碼和資料:https://github.com/THU-KEG/LongWriter-V + +##### **Improving the Diffusability of Autoencoders** +2502.14831v1 by Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin + +Latent diffusion models have emerged as the leading approach for generating +high-quality images and videos, utilizing compressed latent representations to +reduce the computational burden of the diffusion process. While recent +advancements have primarily focused on scaling diffusion backbones and +improving autoencoder reconstruction quality, the interaction between these +components has received comparatively less attention. In this work, we perform +a spectral analysis of modern autoencoders and identify inordinate +high-frequency components in their latent spaces, which are especially +pronounced in the autoencoders with a large bottleneck channel size. We +hypothesize that this high-frequency component interferes with the +coarse-to-fine nature of the diffusion synthesis process and hinders the +generation quality. To mitigate the issue, we propose scale equivariance: a +simple regularization strategy that aligns latent and RGB spaces across +frequencies by enforcing scale equivariance in the decoder. It requires minimal +code changes and only up to 20K autoencoder fine-tuning steps, yet +significantly improves generation quality, reducing FID by 19% for image +generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation +on Kinetics-700 17x256x256. + +摘要:潛在擴散模型已成為生成高品質影像和影片的主流方法,利用壓縮潛在表示來降低擴散過程的計算負擔。雖然近期的進展主要集中在擴充擴散主幹並提升自編碼器重建品質,但這些組成之間的交互作用卻鮮少受到關注。在這項研究中,我們對現代自編碼器進行頻譜分析,並在它們的潛在空間中找出不適當的高頻率組成,這在瓶頸通道尺寸較大的自編碼器中特別明顯。我們假設這種高頻率組成會干擾擴散合成過程由粗到細的性質,並阻礙生成品質。為了緩解這個問題,我們提出規模等變性:一種簡單的正則化策略,透過在解碼器中強制執行規模等變性,使潛在空間和 RGB 空間在各個頻率中保持一致。它只需要最小的程式碼變更,且僅需最多 20K 個自編碼器微調步驟,就能顯著提升生成品質,將 ImageNet-1K 256x256 上的影像生成的 FID 降低 19%,並將 Kinetics-700 17x256x256 上的影片生成的 FVD 降低至少 44%。 + +##### **Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs** +2502.14830v1 by Danni Liu, Jan Niehues + +While large language models demonstrate remarkable capabilities at +task-specific applications through fine-tuning, extending these benefits across +diverse languages is essential for broad accessibility. However, effective +cross-lingual transfer is hindered by LLM performance gaps across languages and +the scarcity of fine-tuning data in many languages. Through analysis of LLM +internal representations from over 1,000+ language pairs, we discover that +middle layers exhibit the strongest potential for cross-lingual alignment. +Building on this finding, we propose a middle-layer alignment objective +integrated into task-specific training. Our experiments on slot filling, +machine translation, and structured text generation show consistent +improvements in cross-lingual transfer, especially to lower-resource languages. +The method is robust to the choice of alignment languages and generalizes to +languages unseen during alignment. Furthermore, we show that separately trained +alignment modules can be merged with existing task-specific modules, improving +cross-lingual capabilities without full re-training. Our code is publicly +available (https://github.com/dannigt/mid-align). + +摘要:儘管大型語言模型在特定任務應用中透過微調展現出卓越的能力,但要讓這些好處擴及各種語言,對於廣泛的可及性來說至關重要。然而,有效的跨語言轉移受到跨語言 LLM 效能差距以及許多語言中微調資料的稀少性所阻礙。透過分析來自 1,000 多種語言對的 LLM 內部表示,我們發現中間層展現出最強的跨語言對齊潛力。根據這個發現,我們提出一個整合到特定任務訓練中的中間層對齊目標。我們在插槽填補、機器翻譯和結構化文字生成方面的實驗顯示,跨語言轉移持續改善,特別是對於低資源語言。此方法對於對齊語言的選擇具有穩健性,並推廣到對齊期間未曾見過的語言。此外,我們展示了單獨訓練的對齊模組可以與現有的特定任務模組合併,在不重新訓練的情況下改善跨語言能力。我們的程式碼已公開(https://github.com/dannigt/mid-align)。 + +##### **Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps** +2502.14829v1 by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov + +When prompted to think step-by-step, language models (LMs) produce a chain of +thought (CoT), a sequence of reasoning steps that the model supposedly used to +produce its prediction. However, despite much work on CoT prompting, it is +unclear if CoT reasoning is faithful to the models' parameteric beliefs. We +introduce a framework for measuring parametric faithfulness of generated +reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an +instance of this framework. FUR erases information contained in reasoning steps +from model parameters. We perform experiments unlearning CoTs of four LMs +prompted on four multi-choice question answering (MCQA) datasets. Our +experiments show that FUR is frequently able to change the underlying models' +prediction by unlearning key steps, indicating when a CoT is parametrically +faithful. Further analysis shows that CoTs generated by models post-unlearning +support different answers, hinting at a deeper effect of unlearning. +Importantly, CoT steps identified as important by FUR do not align well with +human notions of plausbility, emphasizing the need for specialized alignment + +摘要:当提示逐步思考时,语言模型 (LM) 会产生一系列思考 (CoT),这是模型用来产生预测的一系列推理步骤。然而,尽管在 CoT 提示上做了很多工作,但尚不清楚 CoT 推理是否符合模型的参数化信念。我们引入了一个框架来衡量生成推理的参数化保真度,并提出了通过取消学习推理步骤 (FUR) 的保真度,这是该框架的一个实例。FUR 从模型参数中擦除推理步骤中包含的信息。我们执行实验,取消学习提示在四个多项选择问答 (MCQA) 数据集上的四个 LM 的 CoT。我们的实验表明,FUR 经常能够通过取消学习关键步骤来改变底层模型的预测,表明 CoT 在参数上是保真的。进一步的分析表明,模型在取消学习后生成的 CoT 支持不同的答案,暗示取消学习具有更深层次的影响。重要的是,FUR 确定的 CoT 步骤与人类对合理性的概念不太一致,强调了专门对齐的必要性 + +##### **Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison** +2502.14827v1 by Aiswarya Baby, Tintu Thankom Koshy + +Visual Question Answering (VQA) has emerged as a pivotal task in the +intersection of computer vision and natural language processing, requiring +models to understand and reason about visual content in response to natural +language questions. Analyzing VQA datasets is essential for developing robust +models that can handle the complexities of multimodal reasoning. Several +approaches have been developed to examine these datasets, each offering +distinct perspectives on question diversity, answer distribution, and +visual-textual correlations. Despite significant progress, existing VQA models +face challenges related to dataset bias, limited model complexity, commonsense +reasoning gaps, rigid evaluation methods, and generalization to real world +scenarios. This paper presents a comprehensive comparative study of five +advanced VQA models: ABC-CNN, KICNLE, Masked Vision and Language Modeling, +BLIP-2, and OFA, each employing distinct methodologies to address these +challenges. + +摘要:視覺問答 (VQA) 已成為電腦視覺與自然語言處理交會中的關鍵任務,要求模型理解和推理視覺內容以回應自然語言問題。分析 VQA 資料集對於開發健全的模型至關重要,這些模型能夠處理多模態推理的複雜性。已經開發出多種方法來檢驗這些資料集,每種方法都提供有關問題多樣性、答案分佈和視覺文本關聯性的不同觀點。儘管有顯著進展,現有的 VQA 模型仍面臨與資料集偏差、模型複雜性有限、常識推理差距、僵化的評估方法和推廣到現實世界場景相關的挑戰。本文對五個先進的 VQA 模型進行了全面的比較研究:ABC-CNN、KICNLE、Masked Vision and Language Modeling、BLIP-2 和 OFA,每個模型都採用不同的方法來應對這些挑戰。 + +##### **eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables** +2502.14820v1 by Luis Antonio Gutiérrez Guanilo, Mir Tafseer Nayeem, Cristian López, Davood Rafiei + +Large Language Models (LLMs) have demonstrated exceptional versatility across +diverse domains, yet their application in e-commerce remains underexplored due +to a lack of domain-specific datasets. To address this gap, we introduce +eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, +including detailed product attributes and user-specific queries. Leveraging +eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to +produce high-quality, attribute-specific product reviews from structured +tabular data. Fine-tuned models were rigorously evaluated using standard +Table2Text metrics, alongside correctness, faithfulness, and fluency +assessments. Our results demonstrate substantial improvements in generating +contextually accurate reviews, highlighting the transformative potential of +tailored datasets and fine-tuning methodologies in optimizing e-commerce +workflows. This work highlights the potential of LLMs in e-commerce workflows +and the essential role of domain-specific datasets in tailoring them to +industry-specific challenges. + +摘要:大型語言模型 (LLM) 在各種領域展現出非凡的多功能性,但由於缺乏特定領域的資料集,因此它們在電子商務中的應用仍未得到充分探索。為了解決這個差距,我們引入了 eC-Tab2Text,這是一個新穎的資料集,旨在捕捉電子商務的複雜性,包括詳細的產品屬性和使用者特定的查詢。利用 eC-Tab2Text,我們專注於從產品表格中產生文字,使 LLM 能夠從結構化的表格資料中產生高品質、特定屬性的產品評論。微調模型使用標準的 Table2Text 指標,以及正確性、忠實度和流利度評估進行嚴格評估。我們的結果證明在產生符合語境的準確評論方面有顯著的進步,突顯了客製化資料集和微調方法在最佳化電子商務工作流程中的轉型潛力。這項工作突顯了 LLM 在電子商務工作流程中的潛力,以及特定領域資料集在因應產業特定挑戰中至關重要的角色。 + +##### **Optimizing Model Selection for Compound AI Systems** +2502.14815v1 by Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica + +Compound AI systems that combine multiple LLM calls, such as self-refine and +multi-agent-debate, achieve strong performance on many AI tasks. We address a +core question in optimizing compound systems: for each LLM call or module in +the system, how should one decide which LLM to use? We show that these LLM +choices have a large effect on quality, but the search space is exponential. We +propose LLMSelector, an efficient framework for model selection in compound +systems, which leverages two key empirical insights: (i) end-to-end performance +is often monotonic in how well each module performs, with all other modules +held fixed, and (ii) per-module performance can be estimated accurately by an +LLM. Building upon these insights, LLMSelector iteratively selects one module +and allocates to it the model with the highest module-wise performance, as +estimated by an LLM, until no further gain is possible. LLMSelector is +applicable to any compound system with a bounded number of modules, and its +number of API calls scales linearly with the number of modules, achieving +high-quality model allocation both empirically and theoretically. Experiments +with popular compound systems such as multi-agent debate and self-refine using +LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector +confers 5%-70% accuracy gains compared to using the same LLM for all modules. + +摘要:複合式 AI 系統結合多個 LLM 呼叫,例如自我精煉和多代理辯論,在許多 AI 任務中都能獲得強大的效能。我們解決了最佳化複合式系統中的核心問題:對於系統中的每個 LLM 呼叫或模組,應該如何決定要使用哪個 LLM?我們表明這些 LLM 選擇對品質有很大的影響,但搜尋空間是呈指數增長的。我們提出 LLMSelector,一種用於複合式系統中模型選擇的有效架構,它利用了兩個主要的經驗見解:(i) 端對端效能通常會隨著每個模組執行得有多好而單調變化,而其他所有模組保持固定,以及 (ii) 每個模組的效能都可以由 LLM 精準估計。LLMSelector 建立在這些見解之上,反覆選擇一個模組,並根據 LLM 估計的模組最佳效能,將模型分配給它,直到無法再進一步提升為止。LLMSelector 適用於任何具有有限數量的模組的複合式系統,其 API 呼叫數量與模組數量成線性比例,在經驗和理論上都實現了高品質的模型配置。使用 GPT-4o、Claude 3.5 Sonnet 和 Gemini 1.5 等 LLM,對多代理辯論和自我精煉等熱門複合式系統進行的實驗表明,與對所有模組使用相同的 LLM 相比,LLMSelector 可帶來 5%-70% 的準確度提升。 + +##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** +2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub + +Foundation models are becoming increasingly effective in the medical domain, +offering pre-trained models on large datasets that can be readily adapted for +downstream tasks. Despite progress, fetal ultrasound images remain a +challenging domain for foundation models due to their inherent complexity, +often requiring substantial additional training and facing limitations due to +the scarcity of paired multimodal data. To overcome these challenges, here we +introduce FetalCLIP, a vision-language foundation model capable of generating +universal representation of fetal ultrasound images. FetalCLIP was pre-trained +using a multimodal learning approach on a diverse dataset of 210,035 fetal +ultrasound images paired with text. This represents the largest paired dataset +of its kind used for foundation model development to date. This unique training +approach allows FetalCLIP to effectively learn the intricate anatomical +features present in fetal ultrasound images, resulting in robust +representations that can be used for a variety of downstream applications. In +extensive benchmarking across a range of key fetal ultrasound applications, +including classification, gestational age estimation, congenital heart defect +(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all +baselines while demonstrating remarkable generalizability and strong +performance even with limited labeled data. We plan to release the FetalCLIP +model publicly for the benefit of the broader scientific community. + +摘要:基礎模型在醫療領域正變得越來越有效, +提供在大型資料集上預先訓練的模型,可輕鬆適應 +下游任務。儘管有進展,但胎兒超音波影像仍然是 +基礎模型的挑戰領域,因為它們固有的複雜性, +通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 +介紹 FetalCLIP,一種能夠產生 +胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 +超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 +方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 +表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 +(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 +效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 + +##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** +2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su + +Our ability to continuously acquire, organize, and leverage knowledge is a +key feature of human intelligence that AI systems must approximate to unlock +their full potential. Given the challenges in continual learning with large +language models (LLMs), retrieval-augmented generation (RAG) has become the +dominant way to introduce new information. However, its reliance on vector +retrieval hinders its ability to mimic the dynamic and interconnected nature of +human long-term memory. Recent RAG approaches augment vector embeddings with +various structures like knowledge graphs to address some of these gaps, namely +sense-making and associativity. However, their performance on more basic +factual memory tasks drops considerably below standard RAG. We address this +unintended deterioration and propose HippoRAG 2, a framework that outperforms +standard RAG comprehensively on factual, sense-making, and associative memory +tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in +HippoRAG and enhances it with deeper passage integration and more effective +online use of an LLM. This combination pushes this RAG system closer to the +effectiveness of human long-term memory, achieving a 7% improvement in +associative memory tasks over the state-of-the-art embedding model while also +exhibiting superior factual knowledge and sense-making memory capabilities. +This work paves the way for non-parametric continual learning for LLMs. Our +code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. + +摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 + +##### **A Survey on Text-Driven 360-Degree Panorama Generation** +2502.14799v1 by Hai Wang, Xiaoyu Xiang, Weihao Xia, Jing-Hao Xue + +The advent of text-driven 360-degree panorama generation, enabling the +synthesis of 360-degree panoramic images directly from textual descriptions, +marks a transformative advancement in immersive visual content creation. This +innovation significantly simplifies the traditionally complex process of +producing such content. Recent progress in text-to-image diffusion models has +accelerated the rapid development in this emerging field. This survey presents +a comprehensive review of text-driven 360-degree panorama generation, offering +an in-depth analysis of state-of-the-art algorithms and their expanding +applications in 360-degree 3D scene generation. Furthermore, we critically +examine current limitations and propose promising directions for future +research. A curated project page with relevant resources and research papers is +available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/. + +摘要:文字驅動 360 度全景圖生成技術的出現,使能從文字描述中直接合成 360 度全景圖像,標誌著沉浸式視覺內容創作的變革性進展。這項創新顯著簡化了傳統上複雜的製作此類內容的過程。最近在文字轉圖像擴散模型方面的進展加速了這個新興領域的快速發展。本調查提供了對文字驅動 360 度全景圖生成的全面回顧,深入分析了最先進的演算法及其在 360 度 3D 場景生成中的擴展應用。此外,我們批判性地審視了當前的限制,並提出了未來研究的有希望的方向。一個精選的專案頁面,其中包含相關資源和研究論文,可在 https://littlewhitesea.github.io/Text-Driven-Pano-Gen/ 獲得。 + +##### **Rapid Word Learning Through Meta In-Context Learning** +2502.14791v1 by Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake + +Humans can quickly learn a new word from a few illustrative examples, and +then systematically and flexibly use it in novel contexts. Yet the abilities of +current language models for few-shot word learning, and methods for improving +these abilities, are underexplored. In this study, we introduce a novel method, +Meta-training for IN-context learNing Of Words (Minnow). This method trains +language models to generate new examples of a word's usage given a few +in-context examples, using a special placeholder token to represent the new +word. This training is repeated on many new words to develop a general +word-learning ability. We find that training models from scratch with Minnow on +human-scale child-directed language enables strong few-shot word learning, +comparable to a large language model (LLM) pre-trained on orders of magnitude +more data. Furthermore, through discriminative and generative evaluations, we +demonstrate that finetuning pre-trained LLMs with Minnow improves their ability +to discriminate between new words, identify syntactic categories of new words, +and generate reasonable new usages and definitions for new words, based on one +or a few in-context examples. These findings highlight the data efficiency of +Minnow and its potential to improve language model performance in word learning +tasks. + +摘要:人類可以從幾個說明性的範例中快速學習一個新字詞,然後系統性且靈活地將其用於新的脈絡中。然而,目前語言模型在少量字詞學習中的能力,以及改善這些能力的方法,尚未得到充分探討。在這項研究中,我們引入了一種新方法,即「用於字詞情境學習的元訓練」(Minnow)。此方法訓練語言模型在給定幾個情境範例的情況下,產生字詞用法的範例,並使用特殊佔位符標記來表示新的字詞。此訓練會在許多新字詞上重複進行,以培養一般的字詞學習能力。我們發現,從頭開始使用 Minnow 在人類規模的兒童導向語言上訓練模型,可以實現強大的少量字詞學習能力,這與預先在大量資料上訓練的大型語言模型 (LLM) 相當。此外,透過區辨性和生成性評估,我們證明使用 Minnow 微調預先訓練的 LLM 可以提升其區辨新字詞、識別新字詞的句法類別,以及根據一個或幾個情境範例產生合理的新用法和定義的能力。這些發現突顯了 Minnow 的資料效率,以及它在字詞學習任務中提升語言模型效能的潛力。 + +##### **SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features** +2502.14786v1 by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai + +We introduce SigLIP 2, a family of new multilingual vision-language encoders +that build on the success of the original SigLIP. In this second iteration, we +extend the original image-text training objective with several prior, +independently developed techniques into a unified recipe -- this includes +captioning-based pretraining, self-supervised losses (self-distillation, masked +prediction) and online data curation. With these changes, SigLIP 2 models +outperform their SigLIP counterparts at all model scales in core capabilities, +including zero-shot classification, image-text retrieval, and transfer +performance when extracting visual representations for Vision-Language Models +(VLMs). Furthermore, the new training recipe leads to significant improvements +on localization and dense prediction tasks. We also train variants which +support multiple resolutions and preserve the input's native aspect ratio. +Finally, we train on a more diverse data-mixture that includes de-biasing +techniques, leading to much better multilingual understanding and improved +fairness. To allow users to trade off inference cost with performance, we +release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), +and g (1B). + +摘要:我們推出了 SigLIP 2,這是一個新的多語言視覺語言編碼器系列,它建立在 SigLIP 的成功基礎上。在這個第二個版本中,我們將原來的圖像文字訓練目標與幾個先前獨立開發的技術擴展到一個統一的配方中,其中包括基於標題的預訓練、自我監督損失(自我蒸餾、遮罩預測)和線上數據策展。有了這些改變,SigLIP 2 模型在所有模型規模上都超越了 SigLIP 的對應模型,包括零次分類、圖像文字檢索和在為視覺語言模型 (VLM) 提取視覺表示時傳輸效能。此外,新的訓練配方也大幅改善了定位和密集預測任務。我們還訓練了支援多種解析度和保留輸入原生長寬比的變體。最後,我們在一個更為多樣化的數據組合上進行訓練,其中包括去偏見技術,從而大幅提升多語言理解力並改善公平性。為了讓使用者權衡推理成本與效能,我們發布了四種大小的模型檢查點:ViT-B (86M)、L (303M)、So400m (400M) 和 g (1B)。 + +##### **ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting** +2502.14780v1 by Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, Minji Kim + +Efficient and privacy-preserving multimodal interaction is essential as AR, +VR, and modern smartphones with powerful cameras become primary interfaces for +human-computer communication. Existing powerful large vision-language models +(VLMs) enabling multimodal interaction often rely on cloud-based processing, +raising significant concerns about (1) visual privacy by transmitting sensitive +vision data to servers, and (2) their limited real-time, on-device usability. +This paper explores Visual Instruction Rewriting, a novel approach that +transforms multimodal instructions into text-only commands, allowing seamless +integration of lightweight on-device instruction rewriter VLMs (250M +parameters) with existing conversational AI systems, enhancing vision data +privacy. To achieve this, we present a dataset of over 39,000 examples across +14 domains and develop a compact VLM, pretrained on image captioning datasets +and fine-tuned for instruction rewriting. Experimental results, evaluated +through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic +parsing analysis, demonstrate that even a quantized version of the model +(<500MB storage footprint) can achieve effective instruction rewriting, thus +enabling privacy-focused, multimodal AI applications. + +摘要:高效且重視隱私的多模態互動至關重要,因為 AR、VR 和配備強大相機的現代智慧型手機已成為人機溝通的主要介面。現有的強大大型視覺語言模型 (VLM) 能支援多模態互動,通常仰賴雲端處理,這引發了重大的疑慮,包括:(1) 將敏感的視覺資料傳輸至伺服器,會造成視覺隱私問題,以及 (2) 其有限的即時、裝置上可用性。本文探討視覺指令改寫,這是一種新穎的方法,可將多模態指令轉換為純文字指令,讓輕量級的裝置上指令改寫 VLM (250M 參數) 與現有的對話式 AI 系統無縫整合,進而強化視覺資料的隱私。為達成此目標,我們提供一個跨越 14 個領域、超過 39,000 個範例的資料集,並開發一個精簡的 VLM,在圖片標題資料集上進行預訓練,並針對指令改寫進行微調。實驗結果透過 NLG 指標(例如 BLEU、METEOR 和 ROUGE)以及語意解析分析進行評估,證明即使是模型的量化版本(<500MB 儲存空間佔用量)也能有效執行指令改寫,進而支援注重隱私的多模態 AI 應用程式。 + +##### **Harnessing PDF Data for Improving Japanese Large Multimodal Models** +2502.14778v1 by Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa + +Large Multimodal Models (LMMs) have demonstrated strong performance in +English, but their effectiveness in Japanese remains limited due to the lack of +high-quality training data. Current Japanese LMMs often rely on translated +English datasets, restricting their ability to capture Japan-specific cultural +knowledge. To address this, we explore the potential of Japanese PDF data as a +training resource, an area that remains largely underutilized. We introduce a +fully automated pipeline that leverages pretrained models to extract image-text +pairs from PDFs through layout analysis, OCR, and vision-language pairing, +removing the need for manual annotation. Additionally, we construct instruction +data from extracted image-text pairs to enrich the training data. To evaluate +the effectiveness of PDF-derived data, we train Japanese LMMs and assess their +performance on the Japanese LMM Benchmark. Our results demonstrate substantial +improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. +Further analysis highlights the impact of PDF-derived data on various factors, +such as model size and language models, reinforcing its value as a multimodal +resource for Japanese LMMs. We plan to make the source code and data publicly +available upon acceptance. + +摘要:大型多模態模型 (LMM) 已在英語中表現出強勁的效能,但由於缺乏高品質的訓練資料,它們在日語中的效能仍然有限。目前的日語 LMM 通常依賴於翻譯後的英語資料集,限制了它們擷取特定於日本的文化知識的能力。為了解決這個問題,我們探索了日語 PDF 資料作為訓練資源的潛力,這個領域在很大程度上仍然未被充分利用。我們引入了一個全自動的管道,利用預先訓練好的模型透過版面分析、光學字元辨識和視覺語言配對從 PDF 中擷取影像文字對,消除了手動註解的需要。此外,我們從擷取的影像文字對中建構說明資料,以豐富訓練資料。為了評估 PDF 衍生資料的效能,我們訓練了日語 LMM,並在日語 LMM 基準上評估它們的效能。我們的結果證明了顯著的進步,在 Heron-Bench 上的效能提升幅度從 3.9% 到 13.8%。進一步的分析重點說明了 PDF 衍生資料對各種因素的影響,例如模型大小和語言模型,加強了其作為日語 LMM 的多模態資源的價值。我們計畫在接受後公開原始程式碼和資料。 + +##### **Making Universal Policies Universal** +2502.14777v1 by Niklas Höpner, David Kuric, Herke van Hoof + +The development of a generalist agent capable of solving a wide range of +sequential decision-making tasks remains a significant challenge. We address +this problem in a cross-agent setup where agents share the same observation +space but differ in their action spaces. Our approach builds on the universal +policy framework, which decouples policy learning into two stages: a +diffusion-based planner that generates observation sequences and an inverse +dynamics model that assigns actions to these plans. We propose a method for +training the planner on a joint dataset composed of trajectories from all +agents. This method offers the benefit of positive transfer by pooling data +from different agents, while the primary challenge lies in adapting shared +plans to each agent's unique constraints. We evaluate our approach on the +BabyAI environment, covering tasks of varying complexity, and demonstrate +positive transfer across agents. Additionally, we examine the planner's +generalisation ability to unseen agents and compare our method to traditional +imitation learning approaches. By training on a pooled dataset from multiple +agents, our universal policy achieves an improvement of up to $42.20\%$ in task +completion accuracy compared to a policy trained on a dataset from a single +agent. + +摘要:開發一種能夠解決廣泛順序決策任務的通才代理仍然是一項重大挑戰。我們在跨代理設置中解決這個問題,其中代理共享相同的觀察空間,但在其動作空間中有所不同。我們的做法建立在通用策略框架之上,該框架將策略學習解耦為兩個階段:生成觀察序列的基於擴散的規劃器和將動作分配給這些計劃的逆動態模型。我們提出了一種在由所有代理的軌跡組成的聯合數據集上訓練規劃器的方法。這種方法提供了通過彙總來自不同代理的數據來進行正向傳輸的好處,而主要的挑戰在於將共享計劃適應於每個代理的唯一約束。我們在 BabyAI 環境中評估了我們的做法,涵蓋了不同複雜程度的任務,並展示了跨代理的正向傳輸。此外,我們檢查了規劃器對未見代理的概括能力,並將我們的做法與傳統的模仿學習方法進行了比較。通過在來自多個代理的彙總數據集上進行訓練,我們的通用策略在任務完成準確度方面實現了高達 42.20% 的改進,而從單個代理的數據集上訓練的策略。 + +##### **SurveyX: Academic Survey Automation via Large Language Models** +2502.14776v1 by Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li + +Large Language Models (LLMs) have demonstrated exceptional comprehension +capabilities and a vast knowledge base, suggesting that LLMs can serve as +efficient tools for automated survey generation. However, recent research +related to automated survey generation remains constrained by some critical +limitations like finite context window, lack of in-depth content discussion, +and absence of systematic evaluation frameworks. Inspired by human writing +processes, we propose SurveyX, an efficient and organized system for automated +survey generation that decomposes the survey composing process into two phases: +the Preparation and Generation phases. By innovatively introducing online +reference retrieval, a pre-processing method called AttributeTree, and a +re-polishing process, SurveyX significantly enhances the efficacy of survey +composition. Experimental evaluation results show that SurveyX outperforms +existing automated survey generation systems in content quality (0.259 +improvement) and citation quality (1.76 enhancement), approaching human expert +performance across multiple evaluation dimensions. Examples of surveys +generated by SurveyX are available on www.surveyx.cn + +摘要:大型語言模型 (LLM) 已展現出卓越的理解能力和廣泛的知識庫,表示 LLM 可作為自動調查生成的有用工具。然而,與自動調查生成相關的最新研究仍受到一些關鍵限制的約束,例如有限的上下文視窗、缺乏深入的內容討論以及系統評估架構的缺失。受到人類寫作過程的啟發,我們提出 SurveyX,這是一個用於自動調查生成的有效且有組織的系統,它將調查組成過程分解為兩個階段:準備和生成階段。透過創新地引入線上參考檢索、一種稱為 AttributeTree 的預處理方法和重新潤飾過程,SurveyX 大幅提升了調查組成的效能。實驗評估結果顯示,SurveyX 在內容品質(提升 0.259)和引用品質(提升 1.76)方面優於現有的自動調查生成系統,在多個評估面向中接近人類專家的表現。由 SurveyX 生成的調查範例可在 www.surveyx.cn 取得 + +##### **Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning** +2502.14768v1 by Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo + +Inspired by the success of DeepSeek-R1, we explore the potential of +rule-based reinforcement learning (RL) in large reasoning models. To analyze +reasoning dynamics, we use synthetic logic puzzles as training data due to +their controllable complexity and straightforward answer verification. We make +some key technical contributions that lead to effective and stable RL training: +a system prompt that emphasizes the thinking and answering process, a stringent +format reward function that penalizes outputs for taking shortcuts, and a +straightforward training recipe that achieves stable convergence. Our 7B model +develops advanced reasoning skills-such as reflection, verification, and +summarization-that are absent from the logic corpus. Remarkably, after training +on just 5K logic problems, it demonstrates generalization abilities to the +challenging math benchmarks AIME and AMC. + +摘要:在 DeepSeek-R1 成功案例的启发下,我们探索了基于规则的强化学习 (RL) 在大型推理模型中的潜力。为了分析推理动态,我们使用合成逻辑难题作为训练数据,因为它们的可控复杂性和直接的答案验证。我们做出了一些关键的技术贡献,这些贡献导致了有效且稳定的 RL 训练:一个强调思考和回答过程的系统提示、一个严格的格式奖励函数,用于惩罚采取捷径的输出,以及一个实现稳定收敛的直接训练配方。我们的 7B 模型发展了高级推理技能,例如反射、验证和总结,这些技能在逻辑语料库中是不存在的。值得注意的是,在仅对 5K 个逻辑问题进行训练后,它展示了对具有挑战性的数学基准 AIME 和 AMC 的泛化能力。 + +##### **Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis** +2502.14767v1 by Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han + +With the exponential growth of research facilitated by modern technology and +improved accessibility, scientific discoveries have become increasingly +fragmented within and across fields. This makes it challenging to assess the +significance, novelty, incremental findings, and equivalent ideas between +related works, particularly those from different research communities. Large +language models (LLMs) have recently demonstrated strong quantitative and +qualitative reasoning abilities, and multi-agent LLM debates have shown promise +in handling complex reasoning tasks by exploring diverse perspectives and +reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a +framework which converts scientific papers into LLM personas that debate their +respective novelties. To emphasize structured, critical reasoning rather than +focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling +fine-grained analysis of independent novelty arguments within scholarly +articles. Through experiments on scientific literature across various domains, +evaluated by expert researchers, we demonstrate that ToD generates informative +arguments, effectively contrasts papers, and supports researchers in their +literature review. + +摘要:隨著現代科技促進的研究呈指數成長,加上可近性的提升,科學發現已在各領域內外變得越來越分散。這使得評估相關作品之間的重要性、新穎性、漸進式發現和等價概念變得具有挑戰性,特別是來自不同研究社群的作品。大型語言模型 (LLM) 近期已展現出強大的量化和質化推理能力,而多重代理 LLM 辯論已在處理複雜推理任務方面展現出潛力,方法是探索不同的觀點和推理路徑。受到此啟發,我們引入了辯論樹 (ToD),這是一個將科學論文轉換為 LLM 人格的架構,這些人格會辯論各自的新穎性。為了強調結構化、批判性推理,而非僅專注於結果,ToD 會動態建構一個辯論樹,讓使用者能夠深入分析學術文章中獨立的新穎性論點。透過在不同領域的科學文獻上進行實驗,並由專家研究員進行評估,我們證明了 ToD 能產生有見地的論點、有效對比論文,並在研究人員的文獻回顧中提供協助。 + +##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** +2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes + +Fact verification (FV) aims to assess the veracity of a claim based on +relevant evidence. The traditional approach for automated FV includes a +three-part pipeline relying on short evidence snippets and encoder-only +inference models. More recent approaches leverage the multi-turn nature of LLMs +to address FV as a step-by-step problem where questions inquiring additional +context are generated and answered until there is enough information to make a +decision. This iterative method makes the verification process rational and +explainable. While these methods have been tested for encyclopedic claims, +exploration on domain-specific and realistic claims is missing. In this work, +we apply an iterative FV system on three medical fact-checking datasets and +evaluate it with multiple settings, including different LLMs, external web +search, and structured reasoning using logic predicates. We demonstrate +improvements in the final performance over traditional approaches and the high +potential of step-by-step FV systems for domain-specific claims. + +摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 + +##### **EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations** +2502.14760v1 by Haotian Zhai, Connor Lawless, Ellen Vitercik, Liu Leqi + +A fundamental problem in combinatorial optimization is identifying equivalent +formulations, which can lead to more efficient solution strategies and deeper +insights into a problem's computational complexity. The need to automatically +identify equivalence between problem formulations has grown as optimization +copilots--systems that generate problem formulations from natural language +descriptions--have proliferated. However, existing approaches to checking +formulation equivalence lack grounding, relying on simple heuristics which are +insufficient for rigorous validation. Inspired by Karp reductions, in this work +we introduce quasi-Karp equivalence, a formal criterion for determining when +two optimization formulations are equivalent based on the existence of a +mapping between their decision variables. We propose EquivaMap, a framework +that leverages large language models to automatically discover such mappings, +enabling scalable and reliable equivalence verification. To evaluate our +approach, we construct the first open-source dataset of equivalent optimization +formulations, generated by applying transformations such as adding slack +variables or valid inequalities to existing formulations. Empirically, +EquivaMap significantly outperforms existing methods, achieving substantial +improvements in correctly identifying formulation equivalence. + +摘要:組合優化中的基本問題在於識別等效公式,這可能導致更有效的解決策略,並更深入地了解問題的計算複雜性。隨著優化輔助系統(從自然語言描述中產生問題公式的系統)的普及,自動識別問題公式之間等價性的需求也隨之增加。然而,現有的公式等價性檢查方法缺乏依據,依賴於簡單的啟發法,而這對於嚴格驗證來說是不夠的。受 Karp 遞減啟發,我們在這項工作中引入了準 Karp 等價性,這是一個正式標準,用於根據決策變數之間的映射存在性來確定兩個優化公式何時等效。我們提出了 EquivaMap,一個利用大型語言模型自動發現此類映射的框架,實現可擴充且可靠的等價性驗證。為了評估我們的做法,我們構建了第一個等效優化公式的開源資料集,該資料集是通過對現有公式套用轉換(例如添加鬆弛變數或有效不等式)產生的。根據經驗,EquivaMap 明顯優於現有方法,在正確識別公式等價性方面取得了顯著進展。 + +##### **On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems** +2502.14759v1 by Juraj Vladika, Florian Matthes + +Retrieval-augmented generation (RAG) has emerged as an approach to augment +large language models (LLMs) by reducing their reliance on static knowledge and +improving answer factuality. RAG retrieves relevant context snippets and +generates an answer based on them. Despite its increasing industrial adoption, +systematic exploration of RAG components is lacking, particularly regarding the +ideal size of provided context, and the choice of base LLM and retrieval +method. To help guide development of robust RAG systems, we evaluate various +context sizes, BM25 and semantic search as retrievers, and eight base LLMs. +Moving away from the usual RAG evaluation with short answers, we explore the +more challenging long-form question answering in two domains, where a good +answer has to utilize the entire context. Our findings indicate that final QA +performance improves steadily with up to 15 snippets but stagnates or declines +beyond that. Finally, we show that different general-purpose LLMs excel in the +biomedical domain than the encyclopedic one, and that open-domain evidence +retrieval in large corpora is challenging. + +摘要:檢索增強生成 (RAG) 已成為一種方法,可透過減少大型語言模型 (LLM) 對靜態知識的依賴,並改善答案的真實性,來增強大型語言模型 (LLM)。RAG 會擷取相關的內容片段,並根據這些片段產生答案。儘管其產業採用率不斷提高,但缺乏對 RAG 組成的系統性探討,特別是在提供的內容的理想大小,以及基礎 LLM 和檢索方法的選擇方面。為了協助引導穩健 RAG 系統的開發,我們評估了各種內容大小、BM25 和語意搜尋作為檢索器,以及八個基礎 LLM。我們不再使用簡短答案進行常見的 RAG 評估,而是探討在兩個領域中更具挑戰性的長篇問答,其中一個好的答案必須利用整個內容。我們的研究結果指出,最終的問答效能會隨著多達 15 個片段而穩定提升,但在超過這個數量後就會停滯或下降。最後,我們表明不同的通用 LLM 在生物醫學領域比百科全書領域更為出色,而且在大型語料庫中進行開放領域證據檢索具有挑戰性。 + +##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** +2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari + +Medical images are acquired at high resolutions with large fields of view in +order to capture fine-grained features necessary for clinical decision-making. +Consequently, training deep learning models on medical images can incur large +computational costs. In this work, we address the challenge of downsizing +medical images in order to improve downstream computational efficiency while +preserving clinically-relevant features. We introduce MedVAE, a family of six +large-scale 2D and 3D autoencoders capable of encoding medical images as +downsized latent representations and decoding latent representations back to +high-resolution images. We train MedVAE autoencoders using a novel two-stage +training approach with 1,052,730 medical images. Across diverse tasks obtained +from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent +representations in place of high-resolution images when training downstream +models can lead to efficiency benefits (up to 70x improvement in throughput) +while simultaneously preserving clinically-relevant features and (2) MedVAE can +decode latent representations back to high-resolution images with high +fidelity. Our work demonstrates that large-scale, generalizable autoencoders +can help address critical efficiency challenges in the medical domain. Our code +is available at https://github.com/StanfordMIMI/MedVAE. + +摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 + +##### **TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators** +2502.14752v1 by Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun + +Triton, a high-level Python-like language designed for building efficient GPU +kernels, is widely adopted in deep learning frameworks due to its portability, +flexibility, and accessibility. However, programming and parallel optimization +still require considerable trial and error from Triton developers. Despite +advances in large language models (LLMs) for conventional code generation, +these models struggle to generate accurate, performance-optimized Triton code, +as they lack awareness of its specifications and the complexities of GPU +programming. More critically, there is an urgent need for systematic +evaluations tailored to Triton. In this work, we introduce TritonBench, the +first comprehensive benchmark for Triton operator generation. TritonBench +features two evaluation channels: a curated set of 184 real-world operators +from GitHub and a collection of operators aligned with PyTorch interfaces. +Unlike conventional code benchmarks prioritizing functional correctness, +TritonBench also profiles efficiency performance on widely deployed GPUs +aligned with industry applications. Our study reveals that current +state-of-the-art code LLMs struggle to generate efficient Triton operators, +highlighting a significant gap in high-performance code generation. TritonBench +will be available at https://github.com/thunlp/TritonBench. + +摘要:Triton 是一種高階的類 Python 語言,專門用於建構高效的 GPU 核心,由於其可移植性、靈活性及可存取性,已廣泛採用於深度學習框架中。然而,編程和並行最佳化仍需要 Triton 開發人員進行大量的試驗和錯誤。儘管大型語言模型 (LLM) 在傳統程式碼產生方面取得了進展,但這些模型在產生準確且效能最佳化的 Triton 程式碼時仍面臨困難,因為它們缺乏對其規格和 GPU 編程複雜性的認識。更重要的是,迫切需要針對 Triton 量身打造的系統性評估。在這項工作中,我們介紹 TritonBench,這是第一個針對 Triton 算子產生進行全面評比的基準。TritonBench 具有兩個評估管道:一組來自 GitHub 的 184 個真實世界算子,以及一組與 PyTorch 介面對齊的算子。與優先考慮功能正確性的傳統程式碼基準不同,TritonBench 還剖析了與產業應用對齊的廣泛部署 GPU 上的效能表現。我們的研究表明,目前最先進的程式碼 LLM 難以產生高效的 Triton 算子,突顯了高性能程式碼產生中的重大差距。TritonBench 將在 https://github.com/thunlp/TritonBench 提供。 + +##### **Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs** +2502.14748v1 by Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, Jordan Boyd-Graber + +A common use of NLP is to facilitate the understanding of large document +collections, with a shift from using traditional topic models to Large Language +Models. Yet the effectiveness of using LLM for large corpus understanding in +real-world applications remains under-explored. This study measures the +knowledge users acquire with unsupervised, supervised LLM-based exploratory +approaches or traditional topic models on two datasets. While LLM-based methods +generate more human-readable topics and show higher average win probabilities +than traditional models for data exploration, they produce overly generic +topics for domain-specific datasets that do not easily allow users to learn +much about the documents. Adding human supervision to the LLM generation +process improves data exploration by mitigating hallucination and +over-genericity but requires greater human effort. In contrast, traditional. +models like Latent Dirichlet Allocation (LDA) remain effective for exploration +but are less user-friendly. We show that LLMs struggle to describe the haystack +of large corpora without human help, particularly domain-specific data, and +face scaling and hallucination limitations due to context length constraints. +Dataset available at https://huggingface. co/datasets/zli12321/Bills. + +摘要:NLP 的常見用途是促進對大型文件集合的理解,從使用傳統主題模型轉向大型語言模型。然而,在現實世界的應用中使用 LLM 了解大型語料庫的有效性仍未得到充分探索。本研究衡量了使用者在兩個資料集上使用無監督、監督的基於 LLM 的探索性方法或傳統主題模型獲得的知識。雖然基於 LLM 的方法會產生更多人類可讀的主題,並且顯示出比傳統模型更高的平均獲勝機率,但它們會為特定領域的資料集產生過於通用的主題,而這些主題不容易讓使用者對文件有深入了解。在 LLM 生成過程中加入人類監督可透過減輕幻覺和過度泛化來改善資料探索,但需要更多的人力。相反地,傳統模型(如潛在狄利克雷配置 (LDA))仍然有效於探索,但使用者友善度較低。我們表明,LLM 難以在沒有人類幫助的情況下描述大型語料庫的乾草堆,特別是特定領域的資料,並且會因上下文長度限制而面臨擴充性和幻覺限制。資料集可於 https://huggingface.co/datasets/zli12321/Bills 取得。 + +##### **HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States** +2502.14744v1 by Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue + +The integration of additional modalities increases the susceptibility of +large vision-language models (LVLMs) to safety risks, such as jailbreak +attacks, compared to their language-only counterparts. While existing research +primarily focuses on post-hoc alignment techniques, the underlying safety +mechanisms within LVLMs remain largely unexplored. In this work , we +investigate whether LVLMs inherently encode safety-relevant signals within +their internal activations during inference. Our findings reveal that LVLMs +exhibit distinct activation patterns when processing unsafe prompts, which can +be leveraged to detect and mitigate adversarial inputs without requiring +extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a +novel tuning-free framework that harnesses internal model activations to +enhance safety. Experimental results show that {HiddenDetect} surpasses +state-of-the-art methods in detecting jailbreak attacks against LVLMs. By +utilizing intrinsic safety-aware patterns, our method provides an efficient and +scalable solution for strengthening LVLM robustness against multimodal threats. +Our code will be released publicly at +https://github.com/leigest519/HiddenDetect. + +摘要:整合其他模态会增加大型视觉语言模型 (LVLMs) 对安全风险的敏感性,例如越狱攻击,与仅语言的对应模型相比。虽然现有的研究主要集中于事后对齐技术,但 LVLMs 内部的基本安全机制在很大程度上仍未得到探索。在这项工作中,我们调查了 LVLMs 在推理过程中是否在其内部激活中固有地编码了与安全相关的信号。我们的研究结果表明,LVLMs 在处理不安全提示时表现出不同的激活模式,这可以用来检测和缓解对抗性输入,而无需进行广泛的微调。基于这一见解,我们引入了 HiddenDetect,这是一个新颖的无调优框架,利用内部模型激活来增强安全性。实验结果表明,{HiddenDetect} 在检测针对 LVLMs 的越狱攻击方面超越了最先进的方法。通过利用内在的安全感知模式,我们的方法为加强 LVLM 对多模态威胁的鲁棒性提供了一种高效且可扩展的解决方案。我们的代码将在 https://github.com/leigest519/HiddenDetect 公开发布。 + +##### **Multi-Agent Coordination across Diverse Applications: A Survey** +2502.14743v1 by Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin-Teng Lin, Yang Shen + +Multi-agent coordination studies the underlying mechanism enabling the +trending spread of diverse multi-agent systems (MAS) and has received +increasing attention, driven by the expansion of emerging applications and +rapid AI advances. This survey outlines the current state of coordination +research across applications through a unified understanding that answers four +fundamental coordination questions: (1) what is coordination; (2) why +coordination; (3) who to coordinate with; and (4) how to coordinate. Our +purpose is to explore existing ideas and expertise in coordination and their +connections across diverse applications, while identifying and highlighting +emerging and promising research directions. First, general coordination +problems that are essential to varied applications are identified and analyzed. +Second, a number of MAS applications are surveyed, ranging from widely studied +domains, e.g., search and rescue, warehouse automation and logistics, and +transportation systems, to emerging fields including humanoid and +anthropomorphic robots, satellite systems, and large language models (LLMs). +Finally, open challenges about the scalability, heterogeneity, and learning +mechanisms of MAS are analyzed and discussed. In particular, we identify the +hybridization of hierarchical and decentralized coordination, human-MAS +coordination, and LLM-based MAS as promising future directions. + +摘要:多智能體協調研究探討了促成各種多智能體系統 (MAS) 流行擴散的底層機制,並隨著新興應用擴展和 AI 快速進展而受到越來越多的關注。這項調查透過統一的理解來概述協調研究的現狀,回答了四個基本的協調問題:(1) 什麼是協調;(2) 為什麼協調;(3) 與誰協調;以及 (4) 如何協調。我們的目的是探索協調中現有的想法和專業知識,以及它們在不同應用中的關聯,同時找出並強調新興且有前景的研究方向。首先,找出並分析了對各種應用至關重要的協調問題。其次,調查了許多 MAS 應用,範圍從廣泛研究的領域(例如搜尋和救援、倉庫自動化和物流,以及運輸系統),到新興領域,包括人形機器人和擬人機器人、衛星系統和大語言模型 (LLM)。最後,分析並討論了有關 MAS 的可擴充性、異質性和學習機制的開放挑戰。特別是,我們將分層協調和分散式協調、人類-MAS 協調和基於 LLM 的 MAS 的混合視為有前景的未來方向。 + +##### **YOLOv12: A Breakdown of the Key Architectural Features** +2502.14740v1 by Mujadded Al Rabbani Alif, Muhammad Hussain + +This paper presents an architectural analysis of YOLOv12, a significant +advancement in single-stage, real-time object detection building upon the +strengths of its predecessors while introducing key improvements. The model +incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and +FlashAttention-driven area-based attention, improving feature extraction, +enhanced efficiency, and robust detections. With multiple model variants, +similar to its predecessors, YOLOv12 offers scalable solutions for both +latency-sensitive and high-accuracy applications. Experimental results manifest +consistent gains in mean average precision (mAP) and inference speed, making +YOLOv12 a compelling choice for applications in autonomous systems, security, +and real-time analytics. By achieving an optimal balance between computational +efficiency and performance, YOLOv12 sets a new benchmark for real-time computer +vision, facilitating deployment across diverse hardware platforms, from edge +devices to high-performance clusters. + +摘要:本文提出 YOLOv12 的架構分析,這是在單階段即時物件偵測領域的重大進展,它建立在前任的優勢之上,同時引入了關鍵改進。該模型結合了最佳化的主幹 (R-ELAN)、7x7 可分離卷積和 FlashAttention 驅動的基於區域的注意力,改進了特徵提取、增強了效率和穩健的偵測。與其前身類似,YOLOv12 具有多種模型變體,為低延遲敏感型和高準確度應用程式提供了可擴充的解決方案。實驗結果顯示在平均準確度 (mAP) 和推論速度方面都有顯著的提升,這使得 YOLOv12 成為自動化系統、安全性和即時分析應用程式的理想選擇。透過在運算效率和效能之間取得最佳平衡,YOLOv12 為即時電腦視覺樹立了新的基準,促進了在各種硬體平台(從邊緣裝置到高性能叢集)上的部署。 + +##### **SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines** +2502.14739v1 by M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang + +Large language models (LLMs) have demonstrated remarkable proficiency in +mainstream academic disciplines such as mathematics, physics, and computer +science. However, human knowledge encompasses over 200 specialized disciplines, +far exceeding the scope of existing benchmarks. The capabilities of LLMs in +many of these specialized fields-particularly in light industry, agriculture, +and service-oriented disciplines-remain inadequately evaluated. To address this +gap, we present SuperGPQA, a comprehensive benchmark that evaluates +graduate-level knowledge and reasoning capabilities across 285 disciplines. Our +benchmark employs a novel Human-LLM collaborative filtering mechanism to +eliminate trivial or ambiguous questions through iterative refinement based on +both LLM responses and expert feedback. Our experimental results reveal +significant room for improvement in the performance of current state-of-the-art +LLMs across diverse knowledge domains (e.g., the reasoning-focused model +DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting +the considerable gap between current model capabilities and artificial general +intelligence. Additionally, we present comprehensive insights from our +management of a large-scale annotation process, involving over 80 expert +annotators and an interactive Human-LLM collaborative system, offering valuable +methodological guidance for future research initiatives of comparable scope. + +摘要:大型語言模型 (LLM) 已展現出在主流學術領域(如數學、物理和電腦科學)的卓越能力。然而,人類知識包含超過 200 個專業領域,遠遠超過現有基準的範圍。LLM 在許多這些專業領域(特別是在輕工業、農業和服務導向領域)的能力仍未得到充分評估。為了解決這個差距,我們提出了 SuperGPQA,這是一個綜合基準,用於評估 285 個領域的研究生級知識和推理能力。我們的基準採用新穎的人類-LLM 協同過濾機制,透過基於 LLM 回應和專家回饋的迭代改進,來消除瑣碎或模稜兩可的問題。我們的實驗結果顯示,當前最先進的 LLM 在不同知識領域的表現仍有很大的改進空間(例如,以推理為重點的模型 DeepSeek-R1 在 SuperGPQA 上達到了 61.82% 的最高準確度),突顯了當前模型能力與人工通用智慧之間的巨大差距。此外,我們從管理大型註釋過程(涉及 80 多位專家註釋者和一個互動式人類-LLM 協作系統)中提出了全面的見解,為未來具有可比規模的研究計畫提供了寶貴的方法論指導。 + +##### **EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration** +2502.14735v1 by Minjie Hong, Yan Xia, Zehan Wang, Jieming Zhu, Ye Wang, Sihang Cai, Xiaoda Yang, Quanyu Dai, Zhenhua Dong, Zhimeng Zhang, Zhou Zhao + +Large language models (LLMs) are increasingly leveraged as foundational +backbones in the development of advanced recommender systems, offering enhanced +capabilities through their extensive knowledge and reasoning. Existing +llm-based recommender systems (RSs) often face challenges due to the +significant differences between the linguistic semantics of pre-trained LLMs +and the collaborative semantics essential for RSs. These systems use +pre-trained linguistic semantics but learn collaborative semantics from scratch +via the llm-Backbone. However, LLMs are not designed for recommendations, +leading to inefficient collaborative learning, weak result correlations, and +poor integration of traditional RS features. To address these challenges, we +propose EAGER-LLM, a decoder-only llm-based generative recommendation framework +that integrates endogenous and exogenous behavioral and semantic information in +a non-intrusive manner. Specifically, we propose 1)dual-source knowledge-rich +item indices that integrates indexing sequences for exogenous signals, enabling +efficient link-wide processing; 2)non-invasive multiscale alignment +reconstruction tasks guide the model toward a deeper understanding of both +collaborative and semantic signals; 3)an annealing adapter designed to finely +balance the model's recommendation performance with its comprehension +capabilities. We demonstrate EAGER-LLM's effectiveness through rigorous testing +on three public benchmarks. + +摘要:大型語言模型(LLM)正日益被用作先進推薦系統開發中的基礎主幹,透過其廣泛的知識和推理能力提供增強功能。現有的基於 LLM 的推薦系統(RS)通常會因為預先訓練的 LLM 語言語義與 RS 必備的協作語義之間的顯著差異而面臨挑戰。這些系統使用預先訓練的語言語義,但透過 LLM 主幹從頭學習協作語義。然而,LLM 並非專為推薦而設計,導致協作學習效率低落、結果關聯性薄弱,以及與傳統 RS 功能整合不佳。為了應對這些挑戰,我們提出 EAGER-LLM,這是一種僅解碼器、基於 LLM 的生成推薦架構,能以非侵入性方式整合內生和外生行為和語義資訊。具體來說,我們提出 1) 雙來源、知識豐富的項目索引,它整合了外生訊號的索引序列,實現了高效的鏈路廣泛處理;2) 非侵入式多尺度對齊重建任務引導模型更深入地理解協作和語義訊號;3) 退火適配器旨在精細地平衡模型的推薦效能與其理解能力。我們透過在三個公共基準上的嚴格測試證明了 EAGER-LLM 的有效性。 + +##### **Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models** +2502.14734v1 by Hongji Li, Andrianos Michail, Reto Gubelmann, Simon Clematide, Juri Opitz + +We propose the Sentence Smith framework that enables controlled and specified +manipulation of text meaning. It consists of three main steps: 1. Parsing a +sentence into a semantic graph, 2. Applying human-designed semantic +manipulation rules, and 3. Generating text from the manipulated graph. A final +filtering step (4.) ensures the validity of the applied transformation. To +demonstrate the utility of Sentence Smith in an application study, we use it to +generate hard negative pairs that challenge text embedding models. Since the +controllable generation makes it possible to clearly isolate different types of +semantic shifts, we can gain deeper insights into the specific strengths and +weaknesses of widely used text embedding models, also addressing an issue in +current benchmarking where linguistic phenomena remain opaque. Human validation +confirms that the generations produced by Sentence Smith are highly accurate. + +摘要:我們提出 Sentence Smith 框架,它能控制並指定文本含義的處理。它包含三個主要步驟:1. 將句子解析成語義圖形,2. 套用人為設計的語義處理規則,3. 從處理過的圖形生成文本。最後的過濾步驟 (4.) 確保套用轉換的有效性。為了在應用研究中展示 Sentence Smith 的效用,我們使用它來產生挑戰文本嵌入模型的困難負面對。由於可控生成能清楚地隔離不同類型的語義轉移,我們能更深入地了解廣泛使用的文本嵌入模型的具體優點和缺點,同時也解決了語言現象在當前基準測試中仍然不透明的問題。人為驗證確認 Sentence Smith 產生的生成高度準確。 + +##### **WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models** +2502.14727v1 by Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao + +Retrieval Augmented Generation (RAG) has gained widespread adoption owing to +its capacity to empower large language models (LLMs) to integrate external +knowledge. However, existing RAG frameworks are primarily designed for +text-based LLMs and rely on Automatic Speech Recognition to process speech +input, which discards crucial audio information, risks transcription errors, +and increases computational overhead. Therefore, we introduce WavRAG, the first +retrieval augmented generation framework with native, end-to-end audio support. +WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw +audio for both embedding and retrieval. 2) WavRAG integrates audio and text +into a unified knowledge representation. Specifically, we propose the +WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge +base, and further enhance the in-context capabilities of spoken dialogue models +through the integration of chain-of-thought reasoning. In comparison to +state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval +performance while delivering a 10x acceleration. Furthermore, WavRAG's unique +text-audio hybrid retrieval capability extends the boundaries of RAG to the +audio modality. + +摘要:檢索增強生成 (RAG) 因其賦能大型語言模型 (LLM) 整合外部知識的能力而獲得廣泛採用。然而,現有的 RAG 框架主要設計用於基於文字的 LLM,並依賴自動語音辨識處理語音輸入,這會捨棄重要的音訊資訊、有轉錄錯誤的風險,並增加運算負擔。因此,我們引入了 WavRAG,這是第一個具備原生端對端音訊支援的檢索增強生成框架。WavRAG 提供兩個主要功能:1) 繞過 ASR,WavRAG 直接處理原始音訊以進行嵌入和檢索。2) WavRAG 將音訊和文字整合到統一的知識表示中。具體來說,我們提出了 WavRetriever 以利於從文字音訊混合知識庫中進行檢索,並透過整合思考鏈推理進一步增強對話模型的語境能力。與最先進的 ASR 文字 RAG 管線相比,WavRAG 達到了相當的檢索效能,同時提供了 10 倍的加速。此外,WavRAG 獨特的文字音訊混合檢索能力將 RAG 的界線延伸到音訊模式。 + +##### **Entity Framing and Role Portrayal in the News** +2502.14718v1 by Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov + +We introduce a novel multilingual hierarchical corpus annotated for entity +framing and role portrayal in news articles. The dataset uses a unique taxonomy +inspired by storytelling elements, comprising 22 fine-grained roles, or +archetypes, nested within three main categories: protagonist, antagonist, and +innocent. Each archetype is carefully defined, capturing nuanced portrayals of +entities such as guardian, martyr, and underdog for protagonists; tyrant, +deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for +innocents. The dataset includes 1,378 recent news articles in five languages +(Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two +critical domains of global significance: the Ukraine-Russia War and Climate +Change. Over 5,800 entity mentions have been annotated with role labels. This +dataset serves as a valuable resource for research into role portrayal and has +broader implications for news analysis. We describe the characteristics of the +dataset and the annotation process, and we report evaluation results on +fine-tuned state-of-the-art multilingual transformers and hierarchical +zero-shot learning using LLMs at the level of a document, a paragraph, and a +sentence. + +摘要:我們引進一個新穎的多語言層級語料庫,其中註解了新聞文章中的實體框架和角色描繪。此資料集使用了一個獨特的分類法,其靈感來自講故事元素,包含 22 個細緻的角色或原型,嵌套在三個主要類別中:主角、對手和無辜者。每個原型都經過仔細定義,捕捉了實體的細微描繪,例如主角的監護人、烈士和弱者;對手的暴君、欺騙者和偏執狂;以及無辜者的受害者、替罪羊和被剝削者。該資料集包括五種語言(保加利亞語、英語、印地語、歐洲葡萄牙語和俄語)中的 1,378 篇近期新聞文章,重點關注兩個具有全球意義的關鍵領域:烏克蘭-俄羅斯戰爭和氣候變遷。超過 5,800 個實體提及已註解為角色標籤。此資料集作為角色描繪研究的寶貴資源,並對新聞分析有更廣泛的影響。我們描述了資料集的特徵和註解過程,並報告了對使用 LLM 在文件、段落和句子層級進行微調的最新多語言轉換器和層級零次學習的評估結果。 + +##### **From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT** +2502.14714v1 by Ahmed Abdeen Hamed, Byung Suk Lee + +The generative capabilities of LLM models present opportunities in +accelerating tasks and concerns with the authenticity of the knowledge it +produces. To address the concerns, we present a computational approach that +systematically evaluates the factual accuracy of biomedical knowledge that an +LLM model has been prompted to generate. Our approach encompasses two +processes: the generation of disease-centric associations and the verification +of them using the semantic knowledge of the biomedical ontologies. Using +ChatGPT as the select LLM model, we designed a set of prompt-engineering +processes to generate linkages between diseases, drugs, symptoms, and genes to +establish grounds for assessments. Experimental results demonstrate high +accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and +genetic information (88%-98%). The symptom term identification accuracy was +notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO +ontologies accordingly. The verification of associations reveals literature +coverage rates of (89%-91%) among disease-drug and disease-gene associations. +The low identification accuracy for symptom terms also contributed to the +verification of symptom-related associations (49%-62%). + +摘要:LLM 模型的生成能力為加速任務和對其產生的知識真實性的疑慮提供了機會。為了解決這些疑慮,我們提出了計算方法,系統性評估 LLM 模型受提示而產生的生物醫學知識的事實準確性。我們的做法包括兩個過程:生成以疾病為中心的關聯,並使用生物醫學本体的語義知識驗證它們。使用 ChatGPT 作為選定的 LLM 模型,我們設計了一組提示工程流程,以生成疾病、藥物、症狀和基因之間的關聯,作為評估的依據。實驗結果證明在識別疾病術語 (88%-97%)、藥物名稱 (90%-91%) 和遺傳資訊 (88%-98%) 方面具有很高的準確性。症狀術語識別準確性顯著較低 (49%-61%),並根據 DOID、ChEBI、SYMPTOM 和 GO 本体進行驗證。關聯驗證顯示疾病-藥物和疾病-基因關聯的文獻覆蓋率為 (89%-91%)。症狀術語的低識別準確性也影響了症狀相關關聯的驗證 (49%-62%)。 + +##### **Data-Efficient Pretraining with Group-Level Data Influence Modeling** +2502.14709v1 by Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong + +Data-efficient pretraining has shown tremendous potential to elevate scaling +laws. This paper argues that effective pretraining data should be curated at +the group level, treating a set of data points as a whole rather than as +independent contributors. To achieve that, we propose Group-Level Data +Influence Modeling (Group-MATES), a novel data-efficient pretraining method +that captures and optimizes group-level data utility. Specifically, Group-MATES +collects oracle group-level influences by locally probing the pretraining model +with data sets. It then fine-tunes a relational data influence model to +approximate oracles as relationship-weighted aggregations of individual +influences. The fine-tuned model selects the data subset by maximizing its +group-level influence prediction, with influence-aware clustering to enable +efficient inference. Experiments on the DCLM benchmark demonstrate that +Group-MATES achieves a 10% relative core score improvement on 22 downstream +tasks over DCLM-Baseline and 5% over individual-influence-based methods, +establishing a new state-of-the-art. Further analyses highlight the +effectiveness of relational data influence models in capturing intricate +interactions between data points. + +摘要:資料有效的預訓練已展現出提升規模化定律的巨大潛力。本文認為,有效的預訓練資料應在群組層級中進行策展,將資料點集合視為一個整體,而非獨立的貢獻者。為達成此目的,我們提出群組層級資料影響建模(Group-MATES),這是一種新穎的資料有效預訓練方法,可擷取和最佳化群組層級資料效用。具體而言,Group-MATES 透過使用資料集在區域探測預訓練模型,收集神諭群組層級影響。接著,微調關係資料影響模型,以關係加權聚合個別影響來近似神諭。微調模型透過最大化其群組層級影響預測,選取資料子集,並透過考量影響的群集,啟用有效率的推論。在 DCLM 基準上的實驗證明,與 DCLM-Baseline 相比,Group-MATES 在 22 個下游任務上達成 10% 的相對核心分數提升,並比基於個別影響的方法高出 5%,建立了新的技術水準。進一步的分析強調了關係資料影響模型在擷取資料點之間的複雜互動上的有效性。 + +##### **Human Misperception of Generative-AI Alignment: A Laboratory Experiment** +2502.14708v1 by Kevin He, Ran Shorrer, Mengjia Xia + +We conduct an incentivized laboratory experiment to study people's perception +of generative artificial intelligence (GenAI) alignment in the context of +economic decision-making. Using a panel of economic problems spanning the +domains of risk, time preference, social preference, and strategic +interactions, we ask human subjects to make choices for themselves and to +predict the choices made by GenAI on behalf of a human user. We find that +people overestimate the degree of alignment between GenAI's choices and human +choices. In every problem, human subjects' average prediction about GenAI's +choice is substantially closer to the average human-subject choice than it is +to the GenAI choice. At the individual level, different subjects' predictions +about GenAI's choice in a given problem are highly correlated with their own +choices in the same problem. We explore the implications of people +overestimating GenAI alignment in a simple theoretical model. + +摘要:我們進行一項誘因實驗室實驗,以研究人們對生成式人工智慧 (GenAI) 在經濟決策制定中的對齊認知。使用涵蓋風險、時間偏好、社會偏好和策略性互動領域的經濟問題小組,我們要求受試者為自己做出選擇,並預測 GenAI 代表人類使用者做出的選擇。我們發現人們高估了 GenAI 選擇和人類選擇之間的對齊程度。在每個問題中,受試者對 GenAI 選擇的平均預測都比對 GenAI 選擇的預測更接近於平均人類受試者選擇。在個人層面上,不同受試者對特定問題中 GenAI 選擇的預測與他們在同一個問題中的選擇高度相關。我們在一個簡單的理論模型中探討了人們高估 GenAI 對齊的影響。 + +##### **Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting** +2502.14704v1 by Yuxuan Yang, Dalin Zhang, Yuxuan Liang, Hua Lu, Huan Li, Gang Chen + +Time Series Forecasting (TSF) is a crucial task in various domains, yet +existing TSF models rely heavily on high-quality data and insufficiently +exploit all available data. This paper explores a novel self-supervised +approach to re-label time series datasets by inherently constructing candidate +datasets. During the optimization of a simple reconstruction network, +intermediates are used as pseudo labels in a self-supervised paradigm, +improving generalization for any predictor. We introduce the Self-Correction +with Adaptive Mask (SCAM), which discards overfitted components and selectively +replaces them with pseudo labels generated from reconstructions. Additionally, +we incorporate Spectral Norm Regularization (SNR) to further suppress +overfitting from a loss landscape perspective. Our experiments on eleven +real-world datasets demonstrate that SCAM consistently improves the performance +of various backbone models. This work offers a new perspective on constructing +datasets and enhancing the generalization of TSF models through self-supervised +learning. + +摘要:時間序列預測 (TSF) 在各個領域中都是一項重要的任務,但現有的 TSF 模型極度依賴高品質的資料,且無法充分利用所有可用的資料。本文探討了一種新穎的自監督方法,藉由內建地建構候選資料集來重新標記時間序列資料集。在最佳化一個簡單的重建網路過程中,中間產物會在自監督範例中作為偽標籤,進而改善任何預測器的概化能力。我們引入了帶有自適應遮罩 (SCAM) 的自我修正,它會捨棄過度擬合的組成,並選擇性地以從重建產生的偽標籤取代它們。此外,我們納入了頻譜範數正規化 (SNR) 來進一步抑制從損失景觀觀點來看產生的過度擬合。我們在 11 個真實世界的資料集上進行的實驗,證明 SCAM 持續改善各種主幹模型的效能。這項工作提供了建構資料集和透過自監督學習來提升 TSF 模型概化能力的新觀點。 + +##### **I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search** +2502.14693v1 by Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu + +Recent advancements in large language models (LLMs) have shown remarkable +potential in automating machine learning tasks. However, existing LLM-based +agents often struggle with low-diversity and suboptimal code generation. While +recent work has introduced Monte Carlo Tree Search (MCTS) to address these +issues, limitations persist in the quality and diversity of thoughts generated, +as well as in the scalar value feedback mechanisms used for node selection. In +this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a +novel approach that iteratively expands tree nodes through an introspective +process that meticulously analyzes solutions and results from parent and +sibling nodes. This facilitates a continuous refinement of the node in the +search tree, thereby enhancing the overall decision-making process.Furthermore, +we integrate a Large Language Model (LLM)-based value model to facilitate +direct evaluation of each node's solution prior to conducting comprehensive +computational rollouts. A hybrid rewarding mechanism is implemented to +seamlessly transition the Q-value from LLM-estimated scores to actual +performance scores. This allows higher-quality nodes to be traversed +earlier.Applied to the various ML tasks, our approach demonstrates a6\% +absolute improvement in performance compared to the strong open-source AutoML +agents, showcasing its effectiveness in enhancing agentic AutoML systems. + +摘要:大型語言模型 (LLM) 的最新進展已展現出自動化機器學習任務的顯著潛力。然而,現有的基於 LLM 的代理通常會遇到低多樣性和次優代碼生成的問題。雖然最近的工作已引入蒙地卡羅樹搜尋 (MCTS) 來解決這些問題,但仍存在於所產生想法的品質和多樣性,以及用於節點選擇的標量值回饋機制中。在本研究中,我們介紹了內省蒙地卡羅樹搜尋 (I-MCTS),這是一種透過內省過程反覆擴展樹節點的新方法,該過程會細緻地分析來自父節點和同層節點的解決方案和結果。這有助於持續改善搜尋樹中的節點,進而增強整體決策制定過程。此外,我們整合了一個基於大型語言模型 (LLM) 的值模型,以便在進行全面運算展開之前直接評估每個節點的解決方案。實作了一種混合獎勵機制,以無縫地將 Q 值從 LLM 估計分數轉換為實際效能分數。這允許較高品質的節點更早被遍歷。應用於各種 ML 任務,我們的做法展示出比強大的開源 AutoML 代理高出 6% 的絕對效能提升,證明了其在增強代理式 AutoML 系統方面的有效性。 + +##### **Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup** +2502.14682v1 by Yonghui Kong, Hongbing Hu, Dan Zhang, Siyuan Chai, Fan Zhang, Wei Wang + +Large language models have demonstrated excellent performance in many tasks, +including Text-to-SQL, due to their powerful in-context learning capabilities. +They are becoming the mainstream approach for Text-to-SQL. However, these +methods still have a significant gap compared to human performance, especially +on complex questions. As the complexity of questions increases, the gap between +questions and SQLs increases. We identify two important gaps: the structural +mapping gap and the lexical mapping gap. To tackle these two gaps, we propose +PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates +gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM). +AQP aims to obtain the structural pattern of the question by removing +database-related information, which enables us to find structurally similar +demonstrations. CSM aims to associate database-related text span in the +question with specific tables or columns in the database, which alleviates the +lexical mapping gap. Experimental results on the Spider and BIRD datasets +demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL + +GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution +accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an +execution accuracy of 64.67\%. + +摘要:大型語言模型在許多任務中表現出色,包括文字轉 SQL,這歸功於它們強大的情境學習能力。它們正成為文字轉 SQL 的主流方法。然而,這些方法與人類的表現仍有顯著差距,特別是在複雜的問題上。隨著問題的複雜性增加,問題和 SQL 之間的差距也隨之增加。我們找出兩個重要的差距:結構對應差距和詞彙對應差距。為了解決這兩個差距,我們提出 PAS-SQL,一種基於 LLM 的高效 SQL 產生管道,它透過抽象查詢模式 (AQP) 和情境架構標記 (CSM) 來縮小差距。AQP 旨在透過移除與資料庫相關的資訊來取得問題的結構模式,這使我們能夠找到結構上相似的範例。CSM 旨在將問題中與資料庫相關的文字範圍與資料庫中的特定表格或欄位關聯起來,這可以縮小詞彙對應差距。在 Spider 和 BIRD 資料集上的實驗結果證明了我們所提出的方法的有效性。具體來說,PAS-SQL + GPT-4o 在 Spider 基準測試中設定了一個新的技術水準,執行準確度為 87.9%,並在 BIRD 資料集上取得領先的結果,執行準確度為 64.67%。 + +##### **How to Get Your LLM to Generate Challenging Problems for Evaluation** +2502.14678v1 by Arkil Patel, Siva Reddy, Dzmitry Bahdanau + +The pace of evolution of Large Language Models (LLMs) necessitates new +approaches for rigorous and comprehensive evaluation. Traditional human +annotation is increasingly impracticable due to the complexities and costs +involved in generating high-quality, challenging problems. In this work, we +introduce CHASE, a unified framework to synthetically generate challenging +problems using LLMs without human involvement. For a given task, our approach +builds a hard problem in a bottom-up manner from simpler components. Moreover, +our framework decomposes the generation process into independently verifiable +sub-tasks, thereby ensuring a high level of quality and correctness. We +implement CHASE to create evaluation benchmarks across three diverse domains: +(1) document-based question answering, (2) repository-level code completion, +and (3) math reasoning. The performance of state-of-the-art LLMs on these +synthetic benchmarks lies in the range of 40-60% accuracy, thereby +demonstrating the effectiveness of our framework at generating challenging +problems. We publicly release our benchmarks and code. + +摘要:大型語言模型 (LLM) 的演化速度需要新的方法來進行嚴謹且全面的評估。由於產生高品質、具挑戰性的問題所涉及的複雜性和成本,傳統的人工標註正變得越來越不可行。在這項工作中,我們介紹了 CHASE,一個統一的框架,用於使用 LLM 合成產生具有挑戰性的問題,而無需人工參與。對於給定的任務,我們的做法是以自下而上的方式從更簡單的組成部分來建立一個困難的問題。此外,我們的框架將生成過程分解為獨立可驗證的子任務,從而確保高品質和正確性。我們實作 CHASE 來建立三個不同領域的評估基準:(1) 基於文件的問答、(2) 儲存庫層級的程式碼完成,以及 (3) 數學推理。最先進的 LLM 在這些合成基準上的效能落在 40-60% 的準確度範圍內,從而證明了我們的框架在產生具有挑戰性的問題上的有效性。我們公開發布我們的基準和程式碼。 + +##### **Data-Constrained Synthesis of Training Data for De-Identification** +2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis + +Many sensitive domains -- such as the clinical domain -- lack widely +available datasets due to privacy risks. The increasing generative capabilities +of large language models (LLMs) have made synthetic datasets a viable path +forward. In this study, we domain-adapt LLMs to the clinical domain and +generate synthetic clinical texts that are machine-annotated with tags for +personally identifiable information using capable encoder-based NER models. The +synthetic corpora are then used to train synthetic NER models. The results show +that training NER models using synthetic corpora incurs only a small drop in +predictive performance. The limits of this process are investigated in a +systematic ablation study -- using both Swedish and Spanish data. Our analysis +shows that smaller datasets can be sufficient for domain-adapting LLMs for data +synthesis. Instead, the effectiveness of this process is almost entirely +contingent on the performance of the machine-annotating NER models trained +using the original data. + +摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 + +##### **BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction** +2502.14676v1 by Ruochen Li, Stamos Katsigiannis, Tae-Kyun Kim, Hubert P. H. Shum + +Trajectory prediction allows better decision-making in applications of +autonomous vehicles or surveillance by predicting the short-term future +movement of traffic agents. It is classified into pedestrian or heterogeneous +trajectory prediction. The former exploits the relatively consistent behavior +of pedestrians, but is limited in real-world scenarios with heterogeneous +traffic agents such as cyclists and vehicles. The latter typically relies on +extra class label information to distinguish the heterogeneous agents, but such +labels are costly to annotate and cannot be generalized to represent different +behaviors within the same class of agents. In this work, we introduce the +behavioral pseudo-labels that effectively capture the behavior distributions of +pedestrians and heterogeneous agents solely based on their motion features, +significantly improving the accuracy of trajectory prediction. To implement the +framework, we propose the Behavioral Pseudo-Label Informed Sparse Graph +Convolution Network (BP-SGCN) that learns pseudo-labels and informs to a +trajectory predictor. For optimization, we propose a cascaded training scheme, +in which we first learn the pseudo-labels in an unsupervised manner, and then +perform end-to-end fine-tuning on the labels in the direction of increasing the +trajectory prediction accuracy. Experiments show that our pseudo-labels +effectively model different behavior clusters and improve trajectory +prediction. Our proposed BP-SGCN outperforms existing methods using both +pedestrian (ETH/UCY, pedestrian-only SDD) and heterogeneous agent datasets +(SDD, Argoverse 1). + +摘要:軌跡預測允許在自動駕駛車輛或監視應用中做出更好的決策,藉由預測交通代理的短期未來移動。它被分類為行人或異質軌跡預測。前者利用行人相對一致的行為,但受限於與自行車騎士和車輛等異質交通代理的真實世界場景。後者通常依賴額外的類別標籤資訊來區分異質代理,但此類標籤的註解成本很高,且無法概括為表示同一類別代理中的不同行為。在這項工作中,我們引入了行為偽標籤,它僅根據行人和異質代理的運動特徵有效捕捉行為分佈,顯著提升軌跡預測的準確度。為實作架構,我們提出了行為偽標籤告知稀疏圖形卷積網路 (BP-SGCN),它學習偽標籤並告知軌跡預測器。針對最佳化,我們提出了一種串聯訓練方案,其中我們首先以非監督的方式學習偽標籤,然後在標籤上執行端到端微調,朝著提升軌跡預測準確度的方向進行。實驗顯示我們的偽標籤有效建模不同的行為叢集,並提升軌跡預測。我們提出的 BP-SGCN 使用行人 (ETH/UCY,僅限行人的 SDD) 和異質代理資料集 (SDD,Argoverse 1) 都優於現有方法。 + +##### **Explanations of Deep Language Models Explain Language Representations in the Brain** +2502.14671v1 by Maryam Rahimi, Yadollah Yaghoobzadeh, Mohammad Reza Daliri + +Recent advances in artificial intelligence have given rise to large language +models (LLMs) that not only achieve human-like performance but also share +computational principles with the brain's language processing mechanisms. While +previous research has primarily focused on aligning LLMs' internal +representations with neural activity, we introduce a novel approach that +leverages explainable AI (XAI) methods to forge deeper connections between the +two domains. Using attribution methods, we quantified how preceding words +contribute to an LLM's next-word predictions and employed these explanations to +predict fMRI recordings from participants listening to the same narratives. Our +findings demonstrate that attribution methods robustly predict brain activity +across the language network, surpassing traditional internal representations in +early language areas. This alignment is hierarchical: early-layer explanations +correspond to the initial stages of language processing in the brain, while +later layers align with more advanced stages. Moreover, the layers more +influential on LLM next-word prediction$\unicode{x2014}$those with higher +attribution scores$\unicode{x2014}$exhibited stronger alignment with neural +activity. This work establishes a bidirectional bridge between AI and +neuroscience. First, we demonstrate that attribution methods offer a powerful +lens for investigating the neural mechanisms of language comprehension, +revealing how meaning emerges from preceding context. Second, we propose using +brain alignment as a metric to evaluate the validity of attribution methods, +providing a framework for assessing their biological plausibility. + +摘要:最近的人工智能的進展產生了大型語言模型 (LLM),它不僅達到類似人類的表現,還與大腦的語言處理機制共享計算原理。雖然先前的研究主要集中於將 LLM 的內部表徵與神經活動對齊,但我們引入了一種新穎的方法,該方法利用可解釋 AI (XAI) 方法在兩個域之間建立更深層的聯繫。使用歸因方法,我們量化了前一個單詞如何促成 LLM 的下一個單詞預測,並利用這些解釋來預測參與者在聆聽相同敘述時的大腦功能性磁共振造影 (fMRI) 記錄。我們的發現表明,歸因方法可以穩健地預測整個語言網路中的大腦活動,超越了早期語言區域中的傳統內部表徵。這種對齊是分層的:早期層次解釋對應於大腦中語言處理的初始階段,而後續層次則與更進階的階段對齊。此外,對 LLM 下一個單詞預測影響力較大的層次(即歸因分數較高的層次)表現出與神經活動更強的對齊。這項工作在 AI 與神經科學之間建立了一個雙向橋樑。首先,我們證明歸因方法提供了一個強大的視角,用於研究語言理解的神經機制,揭示意義如何從先前的脈絡中產生。其次,我們建議使用大腦對齊作為評估歸因方法有效性的指標,提供了一個評估其生物學合理性的框架。 + +##### **AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO** +2502.14669v1 by Alan Dao, Dinh Bach Vu + +Large Language Models (LLMs) have demonstrated impressive capabilities in +language processing, yet they often struggle with tasks requiring genuine +visual spatial reasoning. In this paper, we introduce a novel two-stage +training framework designed to equip standard LLMs with visual reasoning +abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) +on a curated dataset of tokenized maze representations to teach the model to +predict step-by-step movement commands. Next, we apply Group Relative Policy +Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted +reward function to refine the model's sequential decision-making and encourage +emergent chain-of-thought behaviors. Experimental results on synthetically +generated mazes show that while a baseline model fails to navigate the maze, +the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning +boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more +robust and self-corrective reasoning, highlighting the potential of our +approach to bridge the gap between language models and visual spatial tasks. +These findings offer promising implications for applications in robotics, +autonomous navigation, and other domains that require integrated visual and +sequential reasoning. + +摘要:大型語言模型(LLM)在語言處理方面展現出令人印象深刻的能力,但它們經常難以應付需要真正視覺空間推理的任務。在本文中,我們介紹了一種新穎的兩階段訓練架構,旨在為標準 LLM 提供迷宮導航的視覺推理能力。首先,我們在標記化迷宮表示的策展資料集上利用監督微調(SFT)來教導模型預測逐步移動指令。接下來,我們使用 DeepSeekR1 中使用的技術,即群體相對策略最佳化(GRPO),並搭配精心設計的獎勵函數來優化模型的順序決策制定,並鼓勵出現連貫的思考行為。在合成產生的迷宮上進行的實驗結果顯示,雖然基準模型無法導航迷宮,但經過 SFT 訓練的模型達到 86% 的準確度,而進一步的 GRPO 微調將準確度提升至 93%。定性分析顯示,GRPO 促進更強健且自我修正的推理,凸顯了我們的方法在彌合語言模型與視覺空間任務之間差距的潛力。這些發現為機器人、自主導航和其他需要整合視覺和順序推理的領域的應用提供了有希望的啟示。 + +##### **InstructAgent: Building User Controllable Recommender via LLM Agent** +2502.14662v1 by Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang + +Traditional recommender systems usually take the user-platform paradigm, +where users are directly exposed under the control of the platform's +recommendation algorithms. However, the defect of recommendation algorithms may +put users in very vulnerable positions under this paradigm. First, many +sophisticated models are often designed with commercial objectives in mind, +focusing on the platform's benefits, which may hinder their ability to protect +and capture users' true interests. Second, these models are typically optimized +using data from all users, which may overlook individual user's preferences. +Due to these shortcomings, users may experience several disadvantages under the +traditional user-platform direct exposure paradigm, such as lack of control +over the recommender system, potential manipulation by the platform, echo +chamber effects, or lack of personalization for less active users due to the +dominance of active users during collaborative learning. Therefore, there is an +urgent need to develop a new paradigm to protect user interests and alleviate +these issues. Recently, some researchers have introduced LLM agents to simulate +user behaviors, these approaches primarily aim to optimize platform-side +performance, leaving core issues in recommender systems unresolved. To address +these limitations, we propose a new user-agent-platform paradigm, where agent +serves as the protective shield between user and recommender system that +enables indirect exposure. To this end, we first construct four recommendation +datasets, denoted as $\dataset$, along with user instructions for each record. + +摘要:傳統推薦系統通常採用使用者-平台範例, +其中使用者直接暴露在平台推薦演算法的控制之下。然而,推薦演算法的缺陷可能會讓使用者在這個範例中處於非常脆弱的位置。首先,許多精密的模型通常在設計時就考慮到商業目標,專注於平台的利益,這可能會阻礙它們保護和掌握使用者真正興趣的能力。其次,這些模型通常使用所有使用者的資料進行最佳化,這可能會忽略個別使用者的偏好。由於這些缺點,使用者可能會在傳統使用者-平台直接暴露範例中遇到一些缺點,例如缺乏對推薦系統的控制、平台的潛在操縱、同溫層效應,或由於活躍使用者在協作學習中的主導地位而缺乏針對較不活躍使用者的個人化。因此,迫切需要開發一種新的範例來保護使用者利益並緩解這些問題。最近,一些研究人員引入了 LLM 代理程式來模擬使用者行為,這些方法主要旨在最佳化平台端的效能,而未解決推薦系統中的核心問題。為了解決這些限制,我們提出了一種新的使用者-代理程式-平台範例,其中代理程式作為使用者和推薦系統之間的保護盾,實現間接暴露。為此,我們首先構建了四個推薦資料集,表示為 $\dataset$,以及每條記錄的使用者說明。 + +##### **Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs** +2502.14645v1 by Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao + +Knowledge editing allows for efficient adaptation of large language models +(LLMs) to new information or corrections without requiring full retraining. +However, prior methods typically focus on either single-language editing or +basic multilingual editing, failing to achieve true cross-linguistic knowledge +synchronization. To address this, we present a simple and practical +state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), +designed to propagate knowledge from a dominant language to other languages +effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition +Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel +dataset to modify in-scope knowledge while preserving unrelated information, +and (ii) Target-language Preference Optimization (TL-PO), which applies +advanced optimization techniques to ensure consistency across languages, +fostering the transfer of updates. Additionally, we contribute a high-quality, +cross-lingual dataset, specifically designed to enhance knowledge transfer +across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks +show that X-KDE significantly enhances cross-lingual performance, achieving an +average improvement of +8.19%, while maintaining high accuracy in monolingual +settings. + +摘要:知識編輯允許大語言模型 (LLM) 有效地適應新資訊或修正,而無需進行完整的再訓練。 +然而,先前的做法通常專注於單一語言編輯或基本的語音編輯,未能實現真正的跨語言知識同步。為了解決這個問題,我們提出了一個簡單且實用的最先進 (SOTA) 配方,即跨語言知識民主編輯 (X-KDE),旨在有效地從主導語言傳播知識到其他語言。我們的 X-KDE 包含兩個階段:(i) 跨語言版本指令調整 (XE-IT),它微調模型,在經過整理的平行資料集上修改範圍內的知識,同時保留不相關的資訊,以及 (ii) 目標語言偏好最佳化 (TL-PO),它應用先進的最佳化技術,以確保跨語言的一致性,促進更新的傳輸。此外,我們貢獻了一個高品質的跨語言資料集,特別設計用於增強跨語言的知識傳輸。在 Bi-ZsRE 和 MzsRE 基準上的廣泛實驗表明,X-KDE 大幅提升了跨語言效能,在單語言設定中維持高準確度的同時,平均提升了 +8.19%。 + +##### **LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning** +2502.14644v1 by Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang + +Long context understanding remains challenging for large language models due +to their limited context windows. This paper presents Long Input Fine-Tuning +(LIFT), a novel framework for long-context modeling that can improve the +long-context performance of arbitrary (short-context) LLMs by dynamically +adapting model parameters based on the long input. Importantly, LIFT, rather +than endlessly extending the context window size to accommodate increasingly +longer inputs in context, chooses to store and absorb the long input in +parameter. By fine-tuning the long input into model parameters, LIFT allows +short-context LLMs to answer questions even when the required information is +not provided in the context during inference. Furthermore, to enhance LIFT +performance while maintaining the original in-context learning (ICL) +capabilities, we introduce Gated Memory, a specialized attention adapter that +automatically balances long input memorization and ICL. We provide a +comprehensive analysis of the strengths and limitations of LIFT on long context +understanding, offering valuable directions for future research. + +摘要:由於大型語言模型的上下文視窗有限,因此對於它們而言,長語境理解仍然具有挑戰性。本文提出了長輸入微調 (LIFT),這是一個用於長語境建模的新穎架構,它可以通過根據長輸入動態調整模型參數來改善任意(短語境)LLM 的長語境效能。重要的是,LIFT 沒有無限擴充上下文視窗大小以容納語境中越來越長的輸入,而是選擇將長輸入儲存在參數中並吸收它。通過將長輸入微調到模型參數中,LIFT 允許短語境 LLM 回答問題,即使在推理期間語境中沒有提供所需資訊也是如此。此外,為了在保持原始語境中學習 (ICL) 能力的同時增強 LIFT 效能,我們引入了閘控記憶體,這是一個自動平衡長輸入記憶和 ICL 的特殊注意力適配器。我們對 LIFT 在長語境理解方面的優缺點進行了全面的分析,為未來的研究提供了有價值的方向。 + +##### **Length-Controlled Margin-Based Preference Optimization without Reference Model** +2502.14643v1 by Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu + +Direct Preference Optimization (DPO) is a widely adopted offline algorithm +for preference-based reinforcement learning from human feedback (RLHF), +designed to improve training simplicity and stability by redefining reward +functions. However, DPO is hindered by several limitations, including length +bias, memory inefficiency, and probability degradation. To address these +challenges, we propose Length-Controlled Margin-Based Preference Optimization +(LMPO), a more efficient and robust alternative. LMPO introduces a uniform +reference model as an upper bound for the DPO loss, enabling a more accurate +approximation of the original optimization objective. Additionally, an average +log-probability optimization strategy is employed to minimize discrepancies +between training and inference phases. A key innovation of LMPO lies in its +Length-Controlled Margin-Based loss function, integrated within the +Bradley-Terry framework. This loss function regulates response length while +simultaneously widening the margin between preferred and rejected outputs. By +doing so, it mitigates probability degradation for both accepted and discarded +responses, addressing a significant limitation of existing methods. We evaluate +LMPO against state-of-the-art preference optimization techniques on two +open-ended large language models, Mistral and LLaMA3, across six conditional +benchmarks. Our experimental results demonstrate that LMPO effectively controls +response length, reduces probability degradation, and outperforms existing +approaches. The code is available at \url{https://github.com/gengxuli/LMPO}. + +摘要:直接偏好優化 (DPO) 是一種廣泛採用的離線演算法,用於從人類回饋 (RLHF) 中進行基於偏好的強化學習,旨在透過重新定義獎勵函數來提升訓練的簡潔性和穩定性。然而,DPO 受到若干限制的阻礙,包括長度偏差、記憶體效率低下和機率下降。為了解決這些挑戰,我們提出長度控制邊際偏好優化 (LMPO),一種更有效率且穩健的替代方案。LMPO 引入統一參考模型作為 DPO 損失的上限,能夠更準確地近似原始最佳化目標。此外,採用平均對數機率最佳化策略來最小化訓練和推論階段之間的差異。LMPO 的一項關鍵創新在於其長度控制邊際損失函數,整合在 Bradley-Terry 架構中。此損失函數調節回應長度,同時擴大偏好和拒絕輸出之間的邊際。藉由這麼做,它減輕了已接受和已捨棄回應的機率下降,解決了現有方法的重大限制。我們在兩個開放式大型語言模型 Mistral 和 LLaMA3 上,針對六個條件基準,評估 LMPO 與最先進的偏好優化技術。我們的實驗結果證明,LMPO 有效控制回應長度,減少機率下降,並優於現有方法。程式碼可在 \url{https://github.com/gengxuli/LMPO} 取得。 + +##### **How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation** +2502.14642v1 by Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui + +Recently, LLMs have garnered increasing attention across academic disciplines +for their potential as human digital twins, virtual proxies designed to +replicate individuals and autonomously perform tasks such as decision-making, +problem-solving, and reasoning on their behalf. However, current evaluations of +LLMs primarily emphasize dialogue simulation while overlooking human behavior +simulation, which is crucial for digital twins. To address this gap, we +introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to +simulate continuous human behavior. BehaviorChain comprises diverse, +high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors +across 1,001 unique personas, each with detailed history and profile metadata. +For evaluation, we integrate persona metadata into LLMs and employ them to +iteratively infer contextually appropriate behaviors within dynamic scenarios +provided by BehaviorChain. Comprehensive evaluation results demonstrated that +even state-of-the-art models struggle with accurately simulating continuous +human behavior. + +摘要:最近,LLM 在各個學科中備受關注,因為它們具有作為人類數位雙胞胎的潛力,也就是虛擬代理人,旨在複製個人並自主執行任務,例如代表他們進行決策、解決問題和推理。然而,LLM 目前的評估主要強調對話模擬,同時忽視了人類行為模擬,這對數位雙胞胎至關重要。為了解決這個差距,我們引入了 BehaviorChain,這是第一個用於評估 LLM 模擬連續人類行為能力的基準。BehaviorChain 包含多樣化、高品質、基於角色的行為鏈,總共涵蓋 1,001 個獨特角色的 15,846 種不同行為,每個角色都有詳細的歷史和個人資料元數據。在評估中,我們將角色元數據整合到 LLM 中,並使用它們在 BehaviorChain 提供的動態場景中反覆推斷出在情境中適當的行為。全面的評估結果表明,即使是最先進的模型在準確模擬連續人類行為方面也存在困難。 + +##### **NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization** +2502.14638v1 by Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber + +Image geo-localization is the task of predicting the specific location of an +image and requires complex reasoning across visual, geographical, and cultural +contexts. While prior Vision Language Models (VLMs) have the best accuracy at +this task, there is a dearth of high-quality datasets and models for analytical +reasoning. We first create NaviClues, a high-quality dataset derived from +GeoGuessr, a popular geography game, to supply examples of expert reasoning +from language. Using this dataset, we present Navig, a comprehensive image +geo-localization framework integrating global and fine-grained image +information. By reasoning with language, Navig reduces the average distance +error by 14% compared to previous state-of-the-art models while requiring fewer +than 1000 training samples. Our dataset and code are available at +https://github.com/SparrowZheyuan18/Navig/. + +摘要:影像地理定位是預測影像特定位置的任務,需要跨視覺、地理和文化脈絡進行複雜的推理。雖然先前的視覺語言模型 (VLM) 在此任務中擁有最佳準確度,但缺乏高品質的資料集和分析推理模型。我們首先建立 NaviClues,這是一個源自 GeoGuessr 的高品質資料集,GeoGuessr 是一款流行的地理遊戲,可提供來自語言的專家推理範例。使用此資料集,我們提出 Navig,這是一個綜合性的影像地理定位架構,整合了全球和細緻的影像資訊。透過語言推理,Navig 將平均距離誤差減少了 14%,與先前的最先進模型相比,同時只需要不到 1000 個訓練樣本。我們的資料集和程式碼可在 https://github.com/SparrowZheyuan18/Navig/ 取得。 + +##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** +2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu + +Protein backbone generation plays a central role in de novo protein design +and is significant for many biological and medical applications. Although +diffusion and flow-based generative models provide potential solutions to this +challenging task, they often generate proteins with undesired designability and +suffer computational inefficiency. In this study, we propose a novel rectified +quaternion flow (ReQFlow) matching method for fast and high-quality protein +backbone generation. In particular, our method generates a local translation +and a 3D rotation from random noise for each residue in a protein chain, which +represents each 3D rotation as a unit quaternion and constructs its flow by +spherical linear interpolation (SLERP) in an exponential format. We train the +model by quaternion flow (QFlow) matching with guaranteed numerical stability +and rectify the QFlow model to accelerate its inference and improve the +designability of generated protein backbones, leading to the proposed ReQFlow +model. Experiments show that ReQFlow achieves state-of-the-art performance in +protein backbone generation while requiring much fewer sampling steps and +significantly less inference time (e.g., being 37x faster than RFDiffusion and +62x faster than Genie2 when generating a backbone of length 300), demonstrating +its effectiveness and efficiency. The code is available at +https://github.com/AngxiaoYue/ReQFlow. + +摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 + +##### **PEARL: Towards Permutation-Resilient LLMs** +2502.14628v1 by Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong + +The in-context learning (ICL) capability of large language models (LLMs) +enables them to perform challenging tasks using provided demonstrations. +However, ICL is highly sensitive to the ordering of demonstrations, leading to +instability in predictions. This paper shows that this vulnerability can be +exploited to design a natural attack - difficult for model providers to detect +- that achieves nearly 80% success rate on LLaMA-3 by simply permuting the +demonstrations. Existing mitigation methods primarily rely on post-processing +and fail to enhance the model's inherent robustness to input permutations, +raising concerns about safety and reliability of LLMs. To address this issue, +we propose Permutation-resilient learning (PEARL), a novel framework based on +distributionally robust optimization (DRO), which optimizes model performance +against the worst-case input permutation. Specifically, PEARL consists of a +permutation-proposal network (P-Net) and the LLM. The P-Net generates the most +challenging permutations by treating it as an optimal transport problem, which +is solved using an entropy-constrained Sinkhorn algorithm. Through minimax +optimization, the P-Net and the LLM iteratively optimize against each other, +progressively improving the LLM's robustness. Experiments on synthetic +pre-training and real-world instruction tuning tasks demonstrate that PEARL +effectively mitigates permutation attacks and enhances performance. Notably, +despite being trained on fewer shots and shorter contexts, PEARL achieves +performance gains of up to 40% when scaled to many-shot and long-context +scenarios, highlighting its efficiency and generalization capabilities. + +摘要:大型語言模型 (LLM) 的語境學習 (ICL) 能力使其能夠透過提供的示範來執行具有挑戰性的任務。然而,ICL 對示範的排序非常敏感,導致預測不穩定。本文顯示,可以利用此漏洞來設計一種自然攻擊,讓模型提供者難以偵測,透過簡單地排列示範,在 LLaMA-3 上達到近 80% 的成功率。現有的緩解方法主要依賴後處理,且無法增強模型對輸入排列的固有穩健性,引發了對 LLM 的安全性與可靠性的疑慮。為了解決此問題,我們提出了一種基於分配穩健最佳化 (DRO) 的新型架構,稱為排列彈性學習 (PEARL),它針對最差情況的輸入排列來最佳化模型效能。具體來說,PEARL 包含排列建議網路 (P-Net) 和 LLM。P-Net 將其視為最優傳輸問題來產生最具挑戰性的排列,並使用熵約束 Sinkhorn 演算法來解決。透過極小極大最佳化,P-Net 和 LLM 迭代地相互最佳化,逐步改善 LLM 的穩健性。在合成預訓練和真實世界指令調整任務上的實驗證明,PEARL 有效地減輕了排列攻擊並增強了效能。值得注意的是,儘管在較少的次數和較短的語境中進行訓練,但 PEARL 在擴展到多重次數和長語境場景時仍可獲得高達 40% 的效能提升,突顯了其效率和泛化能力。 + +##### **ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors** +2502.14627v1 by Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou + +Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to +retrieve audio clips or multilingual texts from databases. However, existing +ML-ATR schemes suffer from inconsistencies for instance similarity matching +across languages. We theoretically analyze the inconsistency in terms of both +multilingual modal alignment direction error and weight error, and propose the +theoretical weight error upper bound for quantifying the inconsistency. Based +on the analysis of the weight error upper bound, we find that the inconsistency +problem stems from the data distribution error caused by random sampling of +languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive +learning and audio-English co-anchor contrastive learning, aiming to mitigate +the negative impact of data distribution error on recall and consistency in +ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets +show that our scheme achieves state-of-the-art performance on recall and +consistency metrics for eight mainstream languages, including English. Our code +will be available at https://github.com/ATRI-ACL/ATRI-ACL. + +摘要:多模態多語言音訊文字檢索 (ML-ATR) 是一項具有挑戰性的任務,旨在從資料庫中檢索音訊片段或多語言文字。然而,現有的 ML-ATR 架構存在不一致的情況,例如跨語言的相似性比對。我們在理論上分析了不一致性,包括多模態多語言對齊方向誤差和權重誤差,並提出理論權重誤差上限以量化不一致性。根據權重誤差上限的分析,我們發現不一致性問題源於由語言隨機取樣造成的資料分佈誤差。我們提出一個一致的 ML-ATR 架構,採用 1 對 k 對比學習和音訊-英語共同錨點對比學習,旨在減輕資料分佈誤差對 ML-ATR 中召回率和一致性的負面影響。在已翻譯的 AudioCaps 和 Clotho 資料集上的實驗結果顯示,我們的架構在包括英語在內的八種主流語言的召回率和一致性指標上達到了最先進的效能。我們的程式碼將在 https://github.com/ATRI-ACL/ATRI-ACL 中提供。 + +##### **Multi-Record Web Page Information Extraction From News Websites** +2502.14625v1 by Alexander Kustenkov, Maksim Varlamov, Alexander Yatskov + +In this paper, we focused on the problem of extracting information from web +pages containing many records, a task of growing importance in the era of +massive web data. Recently, the development of neural network methods has +improved the quality of information extraction from web pages. Nevertheless, +most of the research and datasets are aimed at studying detailed pages. This +has left multi-record "list pages" relatively understudied, despite their +widespread presence and practical significance. + To address this gap, we created a large-scale, open-access dataset +specifically designed for list pages. This is the first dataset for this task +in the Russian language. Our dataset contains 13,120 web pages with news lists, +significantly exceeding existing datasets in both scale and complexity. Our +dataset contains attributes of various types, including optional and +multi-valued, providing a realistic representation of real-world list pages. +These features make our dataset a valuable resource for studying information +extraction from pages containing many records. + Furthermore, we proposed our own multi-stage information extraction methods. +In this work, we explore and demonstrate several strategies for applying +MarkupLM to the specific challenges of multi-record web pages. Our experiments +validate the advantages of our methods. + By releasing our dataset to the public, we aim to advance the field of +information extraction from multi-record pages. + +摘要:在本文中,我們專注於從包含大量記錄的網頁中提取資訊的問題,這項任務在海量網路資料的時代中越來越重要。最近,神經網路方法的發展已改善從網頁中提取資訊的品質。儘管如此,大多數的研究和資料集都旨在研究詳細的網頁。儘管多記錄「清單網頁」廣泛存在且具有實用意義,但它們相對來說研究較少。 +為了解決這個差距,我們建立了一個專門針對清單網頁設計的大規模、開放存取的資料集。這是俄語中第一個針對此任務的資料集。我們的資料集包含 13,120 個包含新聞清單的網頁,在規模和複雜度上都遠遠超過現有的資料集。我們的資料集包含各種類型的屬性,包括可選和多值,提供真實世界清單網頁的實際表示。這些特點使我們的資料集成為研究從包含大量記錄的網頁中提取資訊的寶貴資源。 +此外,我們提出了我們自己的多階段資訊提取方法。在這項工作中,我們探討並展示了將 MarkupLM 應用於多記錄網頁特定挑戰的幾種策略。我們的實驗驗證了我們方法的優點。 +透過向公眾發布我們的資料集,我們旨在推進從多記錄網頁中提取資訊的領域。 + +##### **Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity** +2502.14620v1 by Xinghan Pan + +This paper investigates the efficacy of RWKV, a novel language model +architecture known for its linear attention mechanism, for generating sentence +embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate +the semantic similarity captured by embeddings from different hidden layers of +a pre-trained RWKV model. The performance is assessed on the Microsoft Research +Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared +against a GloVe-based baseline. My results indicate that while RWKV embeddings +capture some semantic relatedness, they underperform compared to the GloVe +baseline in terms of Spearman correlation. I also analyze the inference time +and GPU memory usage, highlighting the computational trade-offs associated with +RWKV embeddings. The findings suggest that while RWKV offers potential +advantages in terms of linear scaling, its zero-shot sentence embedding quality +for semantic similarity tasks requires further investigation and potential +task-specific fine-tuning to match or exceed simpler baselines. + +摘要:本文探討 RWKV 的效能,這是一種以線性注意力機制聞名的語言模型架構,可用於在零次學習設定中產生句子嵌入。我進行逐層分析,以評估預先訓練的 RWKV 模型中不同隱藏層的嵌入所擷取的語義相似性。效能評估使用 Microsoft Research Paraphrase Corpus (MRPC) 資料集,採用 Spearman 相關係數,並與基於 GloVe 的基準進行比較。我的結果顯示,雖然 RWKV 嵌入可以擷取一些語義相關性,但與 GloVe 基準相比,在 Spearman 相關係數方面表現不佳。我也分析了推論時間和 GPU 記憶體使用量,強調與 RWKV 嵌入相關的運算折衷。這些發現表明,雖然 RWKV 在線性縮放方面具有潛在優勢,但其在語義相似性任務中的零次學習句子嵌入品質需要進一步探討,並需要潛在的特定任務微調,才能達到或超越較簡單的基準。 + +##### **Reward Models Identify Consistency, Not Causality** +2502.14619v1 by Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li + +Reward models (RMs) play a crucial role in aligning large language models +(LLMs) with human preferences and enhancing reasoning quality. Traditionally, +RMs are trained to rank candidate outputs based on their correctness and +coherence. However, in this work, we present several surprising findings that +challenge common assumptions about RM behavior. Our analysis reveals that +state-of-the-art reward models prioritize structural consistency over causal +correctness. Specifically, removing the problem statement has minimal impact on +reward scores, whereas altering numerical values or disrupting the reasoning +flow significantly affects RM outputs. Furthermore, RMs exhibit a strong +dependence on complete reasoning trajectories truncated or incomplete steps +lead to significant variations in reward assignments, indicating that RMs +primarily rely on learned reasoning patterns rather than explicit problem +comprehension. These findings hold across multiple architectures, datasets, and +tasks, leading to three key insights: (1) RMs primarily assess coherence rather +than true reasoning quality; (2) The role of explicit problem comprehension in +reward assignment is overstated; (3) Current RMs may be more effective at +ranking responses than verifying logical validity. Our results suggest a +fundamental limitation in existing reward modeling approaches, emphasizing the +need for a shift toward causality-aware reward models that go beyond +consistency-driven evaluation. + +摘要:獎勵模型 (RM) 在將大型語言模型 (LLM) 與人類偏好對齊並提升推理品質方面扮演至關重要的角色。傳統上,RM 會訓練來根據候選輸出的正確性和一致性進行排名。然而,在這項工作中,我們提出幾個令人驚訝的發現,挑戰了關於 RM 行為的常見假設。我們的分析顯示,最先進的獎勵模型優先考慮結構一致性,而不是因果正確性。具體來說,移除問題陳述對獎勵分數的影響很小,而改變數值或中斷推理流程則會顯著影響 RM 輸出。此外,RM 表現出對完整推理軌跡的強烈依賴性,截斷或不完整的步驟會導致獎勵分配產生重大變化,這表示 RM 主要依賴於學習到的推理模式,而不是明確的問題理解。這些發現適用於多種架構、資料集和任務,得出三個關鍵見解:(1) RM 主要評估一致性,而不是真正的推理品質;(2) 在獎勵分配中,明確問題理解的角色被誇大了;(3) 目前的 RM 在排名回應方面可能比驗證邏輯有效性更有效。我們的結果表明現有獎勵建模方法存在根本限制,強調需要轉向因果感知獎勵模型,超越以一致性為導向的評估。 + +##### **FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis** +2502.14614v1 by Mingyi Jia, Junwen Duan, Yan Song, Jianxin Wang + +Retrieval-Augmented Large Language Models (LLMs), which integrate external +knowledge into LLMs, have shown remarkable performance in various medical +domains, including clinical diagnosis. However, existing RAG methods struggle +to effectively assess task difficulty to make retrieval decisions, thereby +failing to meet the clinical requirements for balancing efficiency and +accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained +\textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework +that improves the reliability of RAG in disease diagnosis scenarios. FIND +incorporates a fine-grained adaptive control module to determine whether +retrieval is necessary based on the information density of the input. By +optimizing the retrieval process and implementing a knowledge filtering module, +FIND ensures that the retrieval is better suited to clinical scenarios. +Experiments on three Chinese electronic medical record datasets demonstrate +that FIND significantly outperforms various baseline methods, highlighting its +effectiveness in clinical diagnosis tasks. + +摘要:檢索增強大型語言模型 (LLM),將外部知識整合至 LLM,已於各種醫療領域展現出卓越效能,包括臨床診斷。然而,現有的 RAG 方法難以有效評估任務難度以做出檢索決策,因此無法滿足平衡效率和精確度的臨床需求。因此,我們在本文中提出 FIND(**F**ine-grained **In**formation **D**ensity Guided Adaptive RAG),一種新穎架構,可提升 RAG 在疾病診斷場景中的可靠性。FIND 整合一個細緻化的自適應控制模組,根據輸入的資訊密度判斷是否需要檢索。透過最佳化檢索程序並實作一個知識過濾模組,FIND 確保檢索更適合臨床場景。在三個中文電子病歷資料集上的實驗顯示,FIND 明顯優於各種基線方法,突顯其在臨床診斷任務中的有效性。 + +##### **Behavioral Analysis of Information Salience in Large Language Models** +2502.14613v1 by Jan Trienes, Jörg Schlötterer, Junyi Jessy Li, Christin Seifert + +Large Language Models (LLMs) excel at text summarization, a task that +requires models to select content based on its importance. However, the exact +notion of salience that LLMs have internalized remains unclear. To bridge this +gap, we introduce an explainable framework to systematically derive and +investigate information salience in LLMs through their summarization behavior. +Using length-controlled summarization as a behavioral probe into the content +selection process, and tracing the answerability of Questions Under Discussion +throughout, we derive a proxy for how models prioritize information. Our +experiments on 13 models across four datasets reveal that LLMs have a nuanced, +hierarchical notion of salience, generally consistent across model families and +sizes. While models show highly consistent behavior and hence salience +patterns, this notion of salience cannot be accessed through introspection, and +only weakly correlates with human perceptions of information salience. + +摘要:大型語言模型 (LLM) 在文字摘要方面表現出色,這項任務需要模型根據重要性來選擇內容。然而,LLM 內化的顯著性準確概念仍不清楚。為了彌補這個差距,我們引入了一個可解釋的架構,透過摘要行為系統性地推導和調查 LLM 中的資訊顯著性。使用長度控制摘要作為行為探測來探討內容選擇過程,並追蹤討論中問題的可回答性,我們推導出一個模型優先處理資訊的方式代理。我們針對四個資料集中的 13 個模型進行的實驗揭示,LLM 具有細緻入微、階層式的顯著性概念,通常在模型系列和大小之間保持一致。雖然模型表現出高度一致的行為,因此具有顯著性模式,但這個顯著性概念無法透過內省來存取,而且與人類對資訊顯著性的認知僅有微弱相關性。 + +##### **A Theory for Conditional Generative Modeling on Multiple Data Sources** +2502.14583v1 by Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu + +The success of large generative models has driven a paradigm shift, +leveraging massive multi-source data to enhance model capabilities. However, +the interaction among these sources remains theoretically underexplored. This +paper takes the first step toward a rigorous analysis of multi-source training +in conditional generative modeling, where each condition represents a distinct +data source. Specifically, we establish a general distribution estimation error +bound in average total variation distance for conditional maximum likelihood +estimation based on the bracketing number. Our result shows that when source +distributions share certain similarities and the model is expressive enough, +multi-source training guarantees a sharper bound than single-source training. +We further instantiate the general theory on conditional Gaussian estimation +and deep generative models including autoregressive and flexible energy-based +models, by characterizing their bracketing numbers. The results highlight that +the number of sources and similarity among source distributions improve the +advantage of multi-source training. Simulations and real-world experiments +validate our theory. Code is available at: +\url{https://github.com/ML-GSAI/Multi-Source-GM}. + +摘要:大型生成模型的成功推動了範例轉移,利用大量多來源資料來增強模型功能。然而,這些來源之間的互動在理論上仍未得到充分探討。本文踏出了嚴謹分析條件生成模型中多來源訓練的第一步,其中每個條件代表一個不同的資料來源。具體來說,我們建立了一個基於括號數的條件最大似然估計的平均總變異距離中的通用分佈估計誤差界限。我們的結果表明,當來源分佈具有一定的相似性且模型具有足夠的表達力時,多來源訓練保證了比單來源訓練更嚴格的界限。我們進一步在條件高斯估計和深度生成模型(包括自迴歸和靈活的基於能量的模型)上例證了通用理論,通過表徵它們的括號數。結果強調了來源數和來源分佈之間的相似性提高了多來源訓練的優勢。模擬和真實世界的實驗驗證了我們的理論。程式碼可在以下網址取得:\url{https://github.com/ML-GSAI/Multi-Source-GM}。 + +##### **A Statistical Case Against Empirical Human-AI Alignment** +2502.14581v1 by Julian Rodemann, Esteban Garces Arias, Christoph Luther, Christoph Jansen, Thomas Augustin + +Empirical human-AI alignment aims to make AI systems act in line with +observed human behavior. While noble in its goals, we argue that empirical +alignment can inadvertently introduce statistical biases that warrant caution. +This position paper thus advocates against naive empirical alignment, offering +prescriptive alignment and a posteriori empirical alignment as alternatives. We +substantiate our principled argument by tangible examples like human-centric +decoding of language models. + +摘要:經驗主義的人工智慧校準旨在使人工智慧系統根據觀察到的人類行為採取行動。儘管目標崇高,我們認為經驗主義校準可能會無意中引入需要謹慎對待的統計偏差。因此,本立場文件主張反對天真的經驗主義校準,提供規範性校準和後驗經驗主義校準作為替代方案。我們以具體的例子(例如以人為中心的語言模型解碼)來證明我們的原則性論點。 + +##### **ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification** +2502.14565v1 by Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack + +Self-awareness, i.e., the ability to assess and correct one's own generation, +is a fundamental aspect of human intelligence, making its replication in large +language models (LLMs) an important yet challenging task. Previous works tackle +this by employing extensive reinforcement learning or rather relying on large +external verifiers. In this work, we propose Refine via Intrinsic +Self-Verification (ReVISE), an efficient and effective framework that enables +LLMs to self-correct their outputs through self-verification. The core idea of +ReVISE is to enable LLMs to verify their reasoning processes and continually +rethink reasoning trajectories based on its verification. We introduce a +structured curriculum based upon online preference learning to implement this +efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., +self-verification and reasoning correction), we tackle each task sequentially +using curriculum learning, collecting both failed and successful reasoning +paths to construct preference pairs for efficient training. During inference, +our approach enjoys natural test-time scaling by integrating self-verification +and correction capabilities, further enhanced by our proposed confidence-aware +decoding mechanism. Our experiments on various reasoning tasks demonstrate that +ReVISE achieves efficient self-correction and significantly improves reasoning +performance. + +摘要:自我覺察,亦即評估和修正自身產出的能力,是人類智慧的基本面向,使其能在大型語言模型 (LLM) 中複製,是一項重要且具挑戰性的任務。先前的研究透過採用廣泛的強化學習或依賴大型外部驗證器來解決這個問題。在這項研究中,我們提出透過內在自我驗證 (ReVISE) 進行精煉,一個有效率且有效的架構,使 LLM 能透過自我驗證來自我修正其產出。ReVISE 的核心概念是讓 LLM 能驗證其推理過程,並根據驗證結果持續重新思考推理軌跡。我們導入一個建構於線上偏好學習的結構化課程,以有效率地實作這項功能。具體來說,由於 ReVISE 涉及兩項具有挑戰性的任務(即自我驗證和推理修正),我們使用課程學習循序漸進地處理每一項任務,收集失敗和成功的推理路徑,以建構偏好對,進行有效率的訓練。在推論期間,我們的作法透過整合自我驗證和修正功能,享有自然的測試時間擴充,並進一步透過我們提出的具備信心感知的解碼機制進行強化。我們在各種推理任務上的實驗顯示,ReVISE 達到有效率的自我修正,並顯著提升推理效能。 + +##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** +2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao + +Large Language Models (LLMs) have demonstrated exceptional abilities in +reasoning for task planning. However, challenges remain under-explored for +parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in +which the model first decomposes a real-life textual task into executable +subtasks and constructs an abstract task graph. The model then understands this +task graph as input and generates a plan for parallel execution. To enhance the +planning capability of complex, scalable graphs, we design an automated and +controllable pipeline to generate synthetic graphs and propose a two-stage +training scheme. Experimental results show that our plan-over-graph method +significantly improves task performance on both API-based LLMs and trainable +open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally +supports parallel execution, demonstrating global efficiency. The code and data +are available at https://github.com/zsq259/Plan-over-Graph. + +摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 + +##### **Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs** +2502.14561v1 by Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos + +This work investigates the ability of open Large Language Models (LLMs) to +predict citation intent through in-context learning and fine-tuning. Unlike +traditional approaches that rely on pre-trained models like SciBERT, which +require extensive domain-specific pretraining and specialized architectures, we +demonstrate that general-purpose LLMs can be adapted to this task with minimal +task-specific data. We evaluate twelve model variations across five prominent +open LLM families using zero, one, few, and many-shot prompting to assess +performance across scenarios. Our experimental study identifies the +top-performing model through extensive experimentation of in-context +learning-related parameters, which we fine-tune to further enhance task +performance. The results highlight the strengths and limitations of LLMs in +recognizing citation intents, providing valuable insights for model selection +and prompt engineering. Additionally, we make our end-to-end evaluation +framework and models openly available for future use. + +摘要:本研究探討開放式大型語言模型 (LLM) 透過情境學習和微調來預測引文意圖的能力。與依賴於預訓練模型(例如 SciBERT)的傳統方法不同,後者需要廣泛的特定領域預訓練和專業架構,我們證明了通用 LLM 可以使用最少的特定任務數據來適應此任務。我們使用零次、一次、少次和多次提示評估五個著名的開放式 LLM 家族中的十二個模型變體,以評估不同場景的效能。我們的實驗研究透過廣泛的實驗來識別情境學習相關參數中效能最佳的模型,我們微調這些參數以進一步增強任務效能。結果突顯了 LLM 在識別引文意圖方面的優點和限制,為模型選擇和提示工程提供了有價值的見解。此外,我們將端到端評估架構和模型公開供未來使用。 + +##### **Less is More: Improving LLM Alignment via Preference Data Selection** +2502.14560v1 by Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He + +Direct Preference Optimization (DPO) has emerged as a promising approach for +aligning large language models with human preferences. While prior work mainly +extends DPO from the aspect of the objective function, we instead improve DPO +from the largely overlooked but critical aspect of data selection. +Specifically, we address the issue of parameter shrinkage caused by noisy data +by proposing a novel margin-maximization principle for dataset curation in DPO +training. To accurately estimate margins for data selection, we propose a +dual-margin guided approach that considers both external reward margins and +implicit DPO reward margins. Extensive experiments demonstrate that our method +reduces computational cost dramatically while improving performance. +Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach +achieves 3\% to 8\% improvements across various Llama and Mistral series models +on the AlpacaEval 2.0 benchmark. Furthermore, our approach seamlessly extends +to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, +while further reducing training time. These results highlight the potential of +data selection strategies for advancing preference optimization. + +摘要:直接偏好最佳化 (DPO) 已成為一種有希望的方法,可將大型語言模型與人類偏好保持一致。雖然先前的研究主要從目標函數的角度延伸 DPO,但我們反而從資料選擇這個極易被忽略但至關重要的角度改進 DPO。 +具體來說,我們透過提出一個用於 DPO 訓練中資料集整理的新邊際最大化原則,來解決由雜訊資料造成的參數收縮問題。為了準確估計資料選擇的邊際,我們提出一個雙邊際引導方法,它同時考慮外部獎勵邊際和隱含 DPO 獎勵邊際。大規模的實驗證明,我們的這種方法大幅降低了運算成本,同時改善了效能。 +值得注意的是,我們的這種方法僅使用 Ultrafeedback 資料集的 10%,便在 AlpacaEval 2.0 基準上,在各種 Llama 和 Mistral 系列模型中取得了 3% 到 8% 的改進。此外,我們的這種方法可以無縫地延伸到迭代 DPO,在使用 25% 線上資料的情況下產生了大約 3% 的改進,同時進一步減少了訓練時間。這些結果突顯了資料選擇策略在推進偏好最佳化方面的潛力。 + +##### **FUIA: Model Inversion Attack against Federated Unlearning** +2502.14558v1 by Lei Zhou, Youwen Zhu + +With the introduction of regulations related to the ``right to be forgotten", +federated learning (FL) is facing new privacy compliance challenges. To address +these challenges, researchers have proposed federated unlearning (FU). However, +existing FU research has primarily focused on improving the efficiency of +unlearning, with less attention paid to the potential privacy vulnerabilities +inherent in these methods. To address this gap, we draw inspiration from +gradient inversion attacks in FL and propose the federated unlearning inversion +attack (FUIA). The FUIA is specifically designed for the three types of FU +(sample unlearning, client unlearning, and class unlearning), aiming to provide +a comprehensive analysis of the privacy leakage risks associated with FU. In +FUIA, the server acts as an honest-but-curious attacker, recording and +exploiting the model differences before and after unlearning to expose the +features and labels of forgotten data. FUIA significantly leaks the privacy of +forgotten data and can target all types of FU. This attack contradicts the goal +of FU to eliminate specific data influence, instead exploiting its +vulnerabilities to recover forgotten data and expose its privacy flaws. +Extensive experimental results show that FUIA can effectively reveal the +private information of forgotten data. To mitigate this privacy leakage, we +also explore two potential defense methods, although these come at the cost of +reduced unlearning effectiveness and the usability of the unlearned model. + +摘要:隨著「被遺忘權」相關法規的推出, +聯盟學習 (FL) 面臨新的隱私合規挑戰。為了應對 +這些挑戰,研究人員提出了聯盟取消學習 (FU)。然而, +現有的 FU 研究主要集中在提高取消學習的效率,較少關注這些方法中固有的潛在隱私漏洞。為了解決這個差距,我們從 +FL 中的梯度反演攻擊中汲取靈感,並提出聯盟取消學習反演 +攻擊 (FUIA)。FUIA 專門設計用於三種類型的 FU +(樣本取消學習、客戶端取消學習和類別取消學習),旨在提供 +對與 FU 相關的隱私洩露風險的全面分析。在 +FUIA 中,伺服器充當誠實但好奇的攻擊者,記錄並 +利用取消學習前後的模型差異來揭露遺忘資料的功能和標籤。FUIA 大幅洩露遺忘資料的隱私,並且可以針對所有類型的 FU。此攻擊與 FU 消除特定資料影響的目標相矛盾,而是利用其 +漏洞來恢復遺忘資料並揭露其隱私缺陷。廣泛的實驗結果表明 FUIA 可以有效揭露遺忘資料的私人資訊。為了減輕這種隱私洩露,我們 +還探索了兩種潛在的防禦方法,儘管這些方法以降低取消學習的有效性和已取消學習模型的可用性為代價。 + +##### **Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling** +2502.14553v1 by Eric Egli, Matteo Manica, Jannis Born + +Bytes form the basis of the digital world and thus are a promising building +block for multimodal foundation models. Recently, Byte Language Models (BLMs) +have emerged to overcome tokenization, yet the excessive length of bytestreams +requires new architectural paradigms. Therefore, we present the Multiscale Byte +Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows +training with context windows of $5$M bytes on single GPU in full model +precision. We thoroughly examine MBLM's performance with Transformer and Mamba +blocks on both unimodal and multimodal tasks. Our experiments demonstrate that +hybrid architectures are efficient in handling extremely long byte sequences +during training while achieving near-linear generational efficiency. To the +best of our knowledge, we present the first evaluation of BLMs on visual Q\&A +tasks and find that, despite serializing images and the absence of an encoder, +a MBLM with pure next token prediction can match custom CNN-LSTM architectures +with designated classification heads. We show that MBLMs exhibit strong +adaptability in integrating diverse data representations, including pixel and +image filestream bytes, underlining their potential toward omnimodal foundation +models. Source code is publicly available at: +https://github.com/ai4sd/multiscale-byte-lm + +摘要:位元組構成數位世界的基礎,因此是多模態基礎模型的一個有前途的建構模組。最近,位元組語言模型 (BLM) 已應運而生,以克服標記化,但位元組串流的過長需要新的架構範例。因此,我們提出多尺度位元組語言模型 (MBLM),這是一個與模型無關的分層解碼器堆疊,允許在單一 GPU 上以完整的模型精度訓練 500 萬位元組的內容視窗。我們徹底檢驗了 MBLM 在單模態和多模態任務上使用 Transformer 和 Mamba 區塊的效能。我們的實驗證明,混合架構在處理訓練期間極長的位元組序列時很有效率,同時達到近乎線性的生成效率。據我們所知,我們提出在視覺問答任務上對 BLM 的首次評估,並發現,儘管序列化影像且沒有編碼器,但具有純粹下一個標記預測的 MBLM 可以匹配具有指定分類標頭的客製化 CNN-LSTM 架構。我們表明,MBLM 在整合各種資料表示形式方面表現出強大的適應性,包括像素和影像檔案串流位元組,強調它們朝向全模態基礎模型的潛力。原始碼已公開於: +https://github.com/ai4sd/multiscale-byte-lm + +##### **Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks** +2502.14546v1 by Maya Bechler-Speicher, Ben Finkelshtein, Fabrizio Frasca, Luis Müller, Jan Tönshoff, Antoine Siraudin, Viktor Zaverkin, Michael M. Bronstein, Mathias Niepert, Bryan Perozzi, Mikhail Galkin, Christopher Morris + +While machine learning on graphs has demonstrated promise in drug design and +molecular property prediction, significant benchmarking challenges hinder its +further progress and relevance. Current benchmarking practices often lack focus +on transformative, real-world applications, favoring narrow domains like +two-dimensional molecular graphs over broader, impactful areas such as +combinatorial optimization, relational databases, or chip design. Additionally, +many benchmark datasets poorly represent the underlying data, leading to +inadequate abstractions and misaligned use cases. Fragmented evaluations and an +excessive focus on accuracy further exacerbate these issues, incentivizing +overfitting rather than fostering generalizable insights. These limitations +have prevented the development of truly useful graph foundation models. This +position paper calls for a paradigm shift toward more meaningful benchmarks, +rigorous evaluation protocols, and stronger collaboration with domain experts +to drive impactful and reliable advances in graph learning research, unlocking +the potential of graph learning. + +摘要:儘管圖形上的機器學習在藥物設計和分子屬性預測方面已展現潛力,但顯著的基準挑戰阻礙了其進一步進展和相關性。目前的基準實務往往缺乏對轉型性、真實世界應用的關注,偏好於狹窄的領域,例如二維分子圖形,而不是組合最佳化、關係資料庫或晶片設計等更廣泛、更有影響力的領域。此外,許多基準資料集無法充分表示基礎資料,導致抽象化不充分和使用案例錯位。支離破碎的評估和過度關注準確性進一步加劇了這些問題,激勵過度擬合,而不是培養可概括的見解。這些限制阻礙了真正有用的圖形基礎模型的開發。這篇立場文件呼籲將範例轉變為更有意義的基準、嚴格的評估協定,以及與領域專家的更強大合作,以推動圖形學習研究中具有影響力和可靠性的進展,釋放圖形學習的潛力。 + +##### **LLM-based User Profile Management for Recommender System** +2502.14541v1 by Seunghwan Bang, Hwanjun Song + +The rapid advancement of Large Language Models (LLMs) has opened new +opportunities in recommender systems by enabling zero-shot recommendation +without conventional training. Despite their potential, most existing works +rely solely on users' purchase histories, leaving significant room for +improvement by incorporating user-generated textual data, such as reviews and +product descriptions. Addressing this gap, we propose PURE, a novel LLM-based +recommendation framework that builds and maintains evolving user profiles by +systematically extracting and summarizing key information from user reviews. +PURE consists of three core components: a Review Extractor for identifying user +preferences and key product features, a Profile Updater for refining and +updating user profiles, and a Recommender for generating personalized +recommendations using the most current profile. To evaluate PURE, we introduce +a continuous sequential recommendation task that reflects real-world scenarios +by adding reviews over time and updating predictions incrementally. Our +experimental results on Amazon datasets demonstrate that PURE outperforms +existing LLM-based methods, effectively leveraging long-term user information +while managing token limitations. + +摘要:大型語言模型 (LLM) 的快速進步為推薦系統開啟了新的機會,它能實現零次學習推薦,而無需傳統訓練。儘管有潛力,但現有的大部分工作僅依賴於使用者的購買記錄,透過納入使用者產生的文字資料,例如評論和產品說明,仍有很大的改進空間。針對此差距,我們提出 PURE,一個新穎的基於 LLM 的推薦架構,透過系統性地從使用者評論中提取和總結關鍵資訊,建立並維護不斷演進的使用者檔案。PURE 由三個核心組成部分組成:一個評論萃取器,用於識別使用者的喜好和產品主要功能;一個檔案更新器,用於精煉和更新使用者檔案;一個推薦器,用於使用最新的檔案產生個人化推薦。為了評估 PURE,我們引入一個連續順序推薦任務,透過隨著時間新增評論和遞增更新預測,反映真實世界的場景。我們在 Amazon 資料集上的實驗結果證明,PURE 優於現有的基於 LLM 的方法,在管理符號限制的同時,有效地利用長期使用者資訊。 + +##### **LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization** +2502.14538v1 by Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu + +Large Language Models (LLMs) have achieved remarkable success in natural +language processing, but their full fine-tuning remains resource-intensive. +Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation +(LoRA), have emerged as a practical solution by approximating parameter updates +with low-rank matrices. However, LoRA often exhibits a "double descent" +phenomenon during fine-tuning, where model performance degrades due to +overfitting and limited expressiveness caused by low-rank constraints. To +address this issue, we propose LoRA-GGPO (Gradient-Guided Perturbation +Optimization), a novel method that leverages gradient and weight norms to +generate targeted perturbations. By optimizing the sharpness of the loss +landscape, LoRA-GGPO guides the model toward flatter minima, mitigating the +double descent problem and improving generalization. Extensive experiments on +natural language understanding (NLU) and generation (NLG) tasks demonstrate +that LoRA-GGPO outperforms LoRA and its state-of-the-art variants. Furthermore, +extended experiments specifically designed to analyze the double descent +phenomenon confirm that LoRA-GGPO effectively alleviates this issue, producing +more robust and generalizable models. Our work provides a robust and efficient +solution for fine-tuning LLMs, with broad applicability in real-world +scenarios. The code is available at https://github.com/llm172/LoRA-GGPO. + +摘要:大型語言模型 (LLM) 在自然語言處理方面取得了顯著的成功,但它們的完全微調仍然需要大量資源。參數高效微調 (PEFT) 方法(例如低秩適應 (LoRA))已成為一種實用的解決方案,它通過低秩矩陣近似參數更新。然而,LoRA 在微調過程中經常表現出「雙重下降」現象,其中模型性能會因過度擬合和低秩約束導致的表達能力有限而下降。為了解決這個問題,我們提出了 LoRA-GGPO(梯度引導擾動優化),這是一種利用梯度和權重範數來產生目標擾動的新方法。通過優化損失函數曲面的陡度,LoRA-GGPO 引導模型朝向更平坦的最小值,從而減輕雙重下降問題並改善泛化能力。在自然語言理解 (NLU) 和生成 (NLG) 任務中進行的廣泛實驗表明,LoRA-GGPO 優於 LoRA 及其最先進的變體。此外,專門設計用於分析雙重下降現象的延伸實驗證實,LoRA-GGPO 有效地緩解了這個問題,產生了更強大且更具泛化能力的模型。我們的研究為微調 LLM 提供了一個強大且高效的解決方案,在現實世界場景中具有廣泛的適用性。代碼可在 https://github.com/llm172/LoRA-GGPO 獲得。 + +##### **CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models** +2502.14529v1 by Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo + +Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated +remarkable real-world capabilities, effectively collaborating to complete +complex tasks. While these systems are designed with safety mechanisms, such as +rejecting harmful instructions through alignment, their security remains +largely unexplored. This gap leaves LLM-MASs vulnerable to targeted +disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks +(Corba), a novel and simple yet highly effective attack that disrupts +interactions between agents within an LLM-MAS. Corba leverages two key +properties: its contagious nature allows it to propagate across arbitrary +network topologies, while its recursive property enables sustained depletion of +computational resources. Notably, these blocking attacks often involve +seemingly benign instructions, making them particularly challenging to mitigate +using conventional alignment methods. We evaluate Corba on two widely-used +LLM-MASs, namely, AutoGen and Camel across various topologies and commercial +models. Additionally, we conduct more extensive experiments in open-ended +interactive LLM-MASs, demonstrating the effectiveness of Corba in complex +topology structures and open-source models. Our code is available at: +https://github.com/zhrli324/Corba. + +摘要:基於大型語言模型的多主體系統(LLM-MAS)已展現出卓越的真實世界能力,有效地協作以完成複雜任務。儘管這些系統設計有安全機制,例如透過對齊拒絕有害指令,但其安全性仍未得到充分探討。此一缺口讓 LLM-MAS 易受針對性的破壞。在本文中,我們介紹了傳染性遞迴封鎖攻擊(Corba),這是一種新穎且簡單但極為有效的攻擊,會破壞 LLM-MAS 中主體之間的互動。Corba 利用了兩個關鍵特性:其傳染性使其能夠在任意網路拓撲中傳播,而其遞迴特性則能持續耗盡運算資源。值得注意的是,這些封鎖攻擊通常涉及看似良性的指令,這使得使用傳統對齊方法來減輕攻擊特別具有挑戰性。我們在兩個廣泛使用的 LLM-MAS,即 AutoGen 和 Camel 上評估了 Corba,涵蓋了各種拓撲和商業模型。此外,我們在開放式互動 LLM-MAS 中進行了更廣泛的實驗,證明了 Corba 在複雜拓撲結構和開源模型中的有效性。我們的程式碼可在以下網址取得:https://github.com/zhrli324/Corba。 + +##### **Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting** +2502.14525v1 by Yannick Wölker, Arash Hajisafi, Cyrus Shahabi, Matthias Renz + +We propose a novel Graph Neural Network (GNN) model, named DeepStateGNN, for +analyzing traffic data, demonstrating its efficacy in two critical tasks: +forecasting and reconstruction. Unlike typical GNN methods that treat each +traffic sensor as an individual graph node, DeepStateGNN clusters sensors into +higher-level graph nodes, dubbed Deep State Nodes, based on various similarity +criteria, resulting in a fixed number of nodes in a Deep State graph. The term +"Deep State" nodes is a play on words, referencing hidden networks of power +that, like these nodes, secretly govern traffic independently of visible +sensors. These Deep State Nodes are defined by several similarity factors, +including spatial proximity (e.g., sensors located nearby in the road network), +functional similarity (e.g., sensors on similar types of freeways), and +behavioral similarity under specific conditions (e.g., traffic behavior during +rain). This clustering approach allows for dynamic and adaptive node grouping, +as sensors can belong to multiple clusters and clusters may evolve over time. +Our experimental results show that DeepStateGNN offers superior scalability and +faster training, while also delivering more accurate results than competitors. +It effectively handles large-scale sensor networks, outperforming other methods +in both traffic forecasting and reconstruction accuracy. + +摘要:我們提出一個名為 DeepStateGNN 的新穎圖形神經網路 (GNN) 模型,用於分析交通數據,並展示其在兩個關鍵任務中的效能:預測和重建。與將每個交通感測器視為個別圖形節點的典型 GNN 方法不同,DeepStateGNN 會根據各種相似性準則將感測器群集到較高層級的圖形節點中,稱為 Deep State 節點,這會在 Deep State 圖形中產生固定數量的節點。「Deep State」節點這個術語是文字遊戲,指的是隱藏的權力網路,就像這些節點一樣,秘密地獨立於可見感測器管理交通。這些 Deep State 節點由幾個相似性因素定義,包括空間接近性(例如,位於道路網路中附近的感測器)、功能相似性(例如,位於類似類型高速公路上的感測器)以及特定條件下的行為相似性(例如,雨中的交通行為)。這種群集方法允許動態和自適應節點分組,因為感測器可以屬於多個群集,而且群集可能會隨著時間演變。我們的實驗結果顯示,DeepStateGNN 提供了卓越的可擴充性和更快的訓練速度,同時也比競爭對手提供了更準確的結果。它有效地處理了大規模感測器網路,在交通預測和重建準確度方面都優於其他方法。 + +##### **Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation** +2502.14523v1 by Austin A. Barr, Robert Rozman, Eddie Guo + +We propose a new framework for zero-shot generation of synthetic tabular +data. Using the large language model (LLM) GPT-4o and plain-language prompting, +we demonstrate the ability to generate high-fidelity tabular data without +task-specific fine-tuning or access to real-world data (RWD) for pre-training. +To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated +synthetic data against data generated with the conditional tabular generative +adversarial network (CTGAN), across three open-access datasets: Iris, Fish +Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o +outperformed CTGAN in preserving means, 95% confidence intervals, bivariate +correlations, and data privacy of RWD, even at amplified sample sizes. Notably, +correlations between parameters were consistently preserved with appropriate +direction and strength. However, refinement is necessary to better retain +distributional characteristics. These findings highlight the potential of LLMs +in tabular data synthesis, offering an accessible alternative to generative +adversarial networks and variational autoencoders. + +摘要:我們提出一個新的架構,用於合成表格資料的零次學習產生。利用大型語言模型 (LLM) GPT-4o 和自然語言提示,我們證明了在沒有特定任務微調或取得真實世界資料 (RWD) 進行預訓練的情況下,產生高保真表格資料的能力。為了對 GPT-4o 進行基準測試,我們比較了 LLM 生成的合成資料與使用條件表格生成對抗網路 (CTGAN) 生成的資料在保真度和隱私性方面的表現,比較對象是三個開放取用的資料集:鳶尾花、魚類測量和房地產估價。儘管採用零次學習方法,GPT-4o 在保留平均值、95% 信賴區間、二元關聯和 RWD 的資料隱私方面都優於 CTGAN,即使在擴增的樣本大小下也是如此。值得注意的是,參數之間的關聯始終保持適當的方向和強度。然而,需要進行改進以更好地保留分佈特徵。這些發現突顯了 LLM 在表格資料合成中的潛力,為生成對抗網路和變異自動編碼器提供了可行的替代方案。 + +##### **MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality** +2502.14509v1 by Artur Kot, Mikołaj Koszowski, Wojciech Chojnowski, Mieszko Rutkowski, Artur Nowakowski, Kamil Guttmann, Mikołaj Pokrywka + +Does multilingual Neural Machine Translation (NMT) lead to The Curse of the +Multlinguality or provides the Cross-lingual Knowledge Transfer within a +language family? In this study, we explore multiple approaches for extending +the available data-regime in NMT and we prove cross-lingual benefits even in +0-shot translation regime for low-resource languages. With this paper, we +provide state-of-the-art open-source NMT models for translating between +selected Slavic languages. We released our models on the HuggingFace Hub +(https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) under +the CC BY 4.0 license. Slavic language family comprises morphologically rich +Central and Eastern European languages. Although counting hundreds of millions +of native speakers, Slavic Neural Machine Translation is under-studied in our +opinion. Recently, most NMT research focuses either on: high-resource languages +like English, Spanish, and German - in WMT23 General Translation Task 7 out of +8 task directions are from or to English; massively multilingual models +covering multiple language groups; or evaluation techniques. + +摘要:多語言神經機器翻譯 (NMT) 是否會導致多語言的詛咒,或在語言家族中提供跨語言知識轉移?在這項研究中,我們探討了多種擴展 NMT 中可用資料範圍的方法,並證明了即使在低資源語言的零次學習翻譯中也有跨語言的優點。透過這篇論文,我們提供了最先進的開源 NMT 模型,用於翻譯選定的斯拉夫語。我們在 HuggingFace Hub (https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) 下根據 CC BY 4.0 授權發布我們的模型。斯拉夫語系包含形態豐富的中歐和東歐語言。儘管擁有數億母語人士,但我們認為斯拉夫神經機器翻譯的研究不足。最近,大多數 NMT 研究都專注於:高資源語言,例如英語、西班牙語和德語 - 在 WMT23 一般翻譯任務中,8 個任務方向中有 7 個來自英語或翻譯成英語;涵蓋多個語言群組的大規模多語言模型;或評估技術。 + +##### **Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases** +2502.14507v1 by Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau + +This study evaluates Large Language Models' (LLMs) ability to simulate +non-native-like English use observed in human second language (L2) learners +interfered with by their native first language (L1). In dialogue-based +interviews, we prompt LLMs to mimic L2 English learners with specific L1s +(e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to +real L2 learner data. Our analysis examines L1-driven linguistic biases, such +as reference word usage and avoidance behaviors, using information-theoretic +and distributional density measures. Results show that modern LLMs (e.g., +Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed +in human L2 data, with distinct influences from various languages (e.g., +Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu +influences noun-verb collocations). Our results reveal the potential of LLMs +for L2 dialogue generation and evaluation for future educational applications. + +摘要:本研究評估大型語言模型 (LLM) 模擬非母語英語使用者的能力,這些使用者會受到母語 (L1) 干擾,而母語是第二語言 (L2) 學習者。在基於對話的訪談中,我們提示 LLM 模仿具有特定 L1(例如日語、泰語、烏爾都語)的 L2 英語學習者,並比較七種語言的輸出與真實的 L2 學習者資料。我們的分析使用資訊理論和分佈密度測量來檢視 L1 驅動的語言偏差,例如參考詞使用和避免行為。結果顯示,現代 LLM(例如 Qwen2.5、LLAMA3.3、DeepseekV3、GPT-4o)複製了在人類 L2 資料中觀察到的 L1 相依模式,並受到各種語言的明顯影響(例如,日語、韓語和普通話顯著影響時態一致性,而烏爾都語影響名詞動詞搭配)。我們的結果揭示了 LLM 在 L2 對話產生和評估方面的潛力,可供未來教育應用使用。 + +##### **PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models** +2502.14504v1 by Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang + +Large Vision-Language Models (LVLMs) have demonstrated remarkable +capabilities across a range of multimodal tasks. However, their inference +efficiency is constrained by the large number of visual tokens processed during +decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token +Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level +Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the +Vision Token Re-attention phenomenon across decoder layers, we dynamically +adjust token retention rates layer by layer. Layers that exhibit stronger +attention to visual information preserve more vision tokens, while layers with +lower vision attention are aggressively pruned. Furthermore, PLPHP applies +pruning at the attention head level, enabling different heads within the same +layer to independently retain critical context. Experiments on multiple +benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and +reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of +0.46% average performance drop, while also achieving notable performance +improvements in multi-image tasks. These results highlight the effectiveness of +fine-grained token pruning and contribute to advancing the efficiency and +scalability of LVLMs. Our source code will be made publicly available. + +摘要:大型視覺語言模型 (LVLMs) 已在各種多模態任務中展現出非凡的能力。然而,其推理效率受到解碼過程中處理的大量視覺符號的限制。為了應對這一挑戰,我們提出逐層逐頭視覺符號剪枝 (PLPHP),這是一種包括層級保留率分配和頭級視覺符號剪枝的兩級細粒度剪枝方法。受解碼器層中視覺符號重新關注現象的啟發,我們動態地逐層調整符號保留率。對視覺資訊表現出更強關注力的層保留更多視覺符號,而視覺關注力較低的層則被積極剪枝。此外,PLPHP 在關注頭級別應用剪枝,使同一層中的不同頭部可以獨立保留關鍵上下文。在多個基準測試上的實驗表明,PLPHP 的解碼速度提高了 18%,且將鍵值快取 (KV 快取) 大小減少了 50% 以上,而代價僅為平均效能下降 0.46%,同時還在多影像任務中實現了顯著的效能提升。這些結果突顯了細粒度符號剪枝的有效性,並有助於提升 LVLMs 的效率和可擴充性。我們的原始碼將公開提供。 + +##### **How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?** +2502.14502v1 by Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov + +The performance of Large Language Models (LLMs) on many tasks is greatly +limited by the knowledge learned during pre-training and stored in the model's +parameters. Low-rank adaptation (LoRA) is a popular and efficient training +technique for updating or domain-specific adaptation of LLMs. In this study, we +investigate how new facts can be incorporated into the LLM using LoRA without +compromising the previously learned knowledge. We fine-tuned +Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our +experiments have shown that the best results are obtained when the training +data contains a mixture of known and new facts. However, this approach is still +potentially harmful because the model's performance on external +question-answering benchmarks declines after such fine-tuning. When the +training data is biased towards certain entities, the model tends to regress to +few overrepresented answers. In addition, we found that the model becomes more +confident and refuses to provide an answer in only few cases. These findings +highlight the potential pitfalls of LoRA-based LLM updates and underscore the +importance of training data composition and tuning parameters to balance new +knowledge integration and general model capabilities. + +摘要:大型語言模型 (LLM) 在許多任務上的表現受到預訓練期間學到的知識和儲存在模型參數中的知識的極大限制。低階適應 (LoRA) 是一種流行且有效的訓練技術,用於更新或 LLM 的特定領域適應。在這項研究中,我們探討如何使用 LoRA 將新事實納入 LLM,同時不損害先前學到的知識。我們使用不同數量的知識微調 Llama-3.1-8B-instruct。我們的實驗表明,當訓練資料包含已知和新事實的混合時,會獲得最佳結果。然而,這種方法仍然具有潛在的危害性,因為模型在外部問答基準上的表現會在這種微調後下降。當訓練資料偏向於某些實體時,模型傾向於回歸到少數過度表示的答案。此外,我們發現模型變得更有信心,並且在極少數情況下拒絕提供答案。這些發現突顯了基於 LoRA 的 LLM 更新的潛在缺點,並強調了訓練資料組成和調整參數以平衡新知識整合和一般模型能力的重要性。 + +##### **Towards a Perspectivist Turn in Argument Quality Assessment** +2502.14501v1 by Julia Romberg, Maximilian Maurer, Henning Wachsmuth, Gabriella Lapesa + +The assessment of argument quality depends on well-established logical, +rhetorical, and dialectical properties that are unavoidably subjective: +multiple valid assessments may exist, there is no unequivocal ground truth. +This aligns with recent paths in machine learning, which embrace the +co-existence of different perspectives. However, this potential remains largely +unexplored in NLP research on argument quality. One crucial reason seems to be +the yet unexplored availability of suitable datasets. We fill this gap by +conducting a systematic review of argument quality datasets. We assign them to +a multi-layered categorization targeting two aspects: (a) What has been +annotated: we collect the quality dimensions covered in datasets and +consolidate them in an overarching taxonomy, increasing dataset comparability +and interoperability. (b) Who annotated: we survey what information is given +about annotators, enabling perspectivist research and grounding our +recommendations for future actions. To this end, we discuss datasets suitable +for developing perspectivist models (i.e., those containing individual, +non-aggregated annotations), and we showcase the importance of a controlled +selection of annotators in a pilot study. + +摘要:論證品質的評估取決於根深蒂固的邏輯、修辭和辯證屬性,這些屬性難免具有主觀性:可能存在多種有效的評估,沒有明確的真實依據。這與機器學習中最近的途徑一致,這些途徑接受了不同觀點的共存。然而,這種潛力在論證品質的 NLP 研究中仍然很大程度上未被探索。一個關鍵原因似乎是尚未探索合適的資料集的可用性。我們通過對論證品質資料集進行系統性回顧來填補這一空白。我們將它們分配到一個多層次分類,針對兩個方面:(a) 已註釋的內容:我們收集資料集中涵蓋的品質維度,並將它們整合到一個總體分類法中,提高資料集的可比性和互操作性。(b) 誰做了註釋:我們調查了關於註釋者的哪些資訊,使觀點主義研究成為可能,並為我們對未來行動的建議奠定基礎。為此,我們討論了適合開發觀點主義模型的資料集(即那些包含個別、非聚合註釋的資料集),並在試驗研究中展示了受控選擇註釋者的重要性。 + +##### **MLGym: A New Framework and Benchmark for Advancing AI Research Agents** +2502.14499v1 by Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu + +We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for +evaluating and developing LLM agents on AI research tasks. This is the first +Gym environment for machine learning (ML) tasks, enabling research on +reinforcement learning (RL) algorithms for training such agents. MLGym-bench +consists of 13 diverse and open-ended AI research tasks from diverse domains +such as computer vision, natural language processing, reinforcement learning, +and game theory. Solving these tasks requires real-world AI research skills +such as generating new ideas and hypotheses, creating and processing data, +implementing ML methods, training models, running experiments, analyzing the +results, and iterating through this process to improve on a given task. We +evaluate a number of frontier large language models (LLMs) on our benchmarks +such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 +Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate +models or agents, generate synthetic data at scale, as well as develop new +learning algorithms for training agents on AI research tasks. We find that +current frontier models can improve on the given baselines, usually by finding +better hyperparameters, but do not generate novel hypotheses, algorithms, +architectures, or substantial improvements. We open-source our framework and +benchmark to facilitate future research in advancing the AI research +capabilities of LLM agents. + +摘要:我們推出 Meta MLGym 和 MLGym-Bench,一個用於評估和開發 AI 研究任務中 LLM 代理的新架構和基準。這是第一個用於機器學習 (ML) 任務的 Gym 環境,可針對訓練此類代理的強化學習 (RL) 演算法進行研究。MLGym-bench 包含 13 項來自不同領域的開放式 AI 研究任務,例如電腦視覺、自然語言處理、強化學習和博弈論。解決這些任務需要實際的 AI 研究技能,例如產生新想法和假設、建立和處理資料、實作 ML 方法、訓練模型、執行實驗、分析結果,並透過此流程反覆運算來改善特定任務。我們在基準上評估許多前沿大型語言模型 (LLM),例如 Claude-3.5-Sonnet、Llama-3.1 405B、GPT-4o、o1-preview 和 Gemini-1.5 Pro。我們的 MLGym 架構讓新增任務、整合和評估模型或代理、大規模產生合成資料,以及開發新的學習演算法以訓練 AI 研究任務中的代理變得容易。我們發現目前的邊界模型可以改善既定的基準,通常是透過尋找更好的超參數,但不會產生新穎的假設、演算法、架構或實質性的改進。我們開放原始碼架構和基準,以促進未來在提升 LLM 代理的 AI 研究能力方面的研究。 + +##### **Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups** +2502.14497v1 by Felix Drinkall, Stefan Zohren, Michael McMahon, Janet B. Pierrehumbert + +Macroeconomic fluctuations and the narratives that shape them form a mutually +reinforcing cycle: public discourse can spur behavioural changes leading to +economic shifts, which then result in changes in the stories that propagate. We +show that shifts in semantic embedding space can be causally linked to +financial market shocks -- deviations from the expected market behaviour. +Furthermore, we show how partisanship can influence the predictive power of +text for market fluctuations and shape reactions to those same shocks. We also +provide some evidence that text-based signals are particularly salient during +unexpected events such as COVID-19, highlighting the value of language data as +an exogenous variable in economic forecasting. Our findings underscore the +bidirectional relationship between news outlets and market shocks, offering a +novel empirical approach to studying their effect on each other. + +摘要:宏觀經濟波動與形塑它們的敘事形成一個相互強化的循環:公共論述可能激發導致經濟變化的行為改變,進而導致宣傳故事的改變。我們表明,語義嵌入空間的轉變可能與金融市場震盪(與預期的市場行為的偏差)有因果關係。此外,我們展示了黨派立場如何影響文字對市場波動的預測能力,以及如何形塑對這些震盪的反應。我們還提供了一些證據,證明在 COVID-19 等意外事件期間,基於文字的信號特別顯著,突顯了語言資料在經濟預測中作為外生變數的價值。我們的研究結果強調了新聞媒體與市場震盪之間的雙向關係,提供了一種研究它們對彼此影響的新穎實證方法。 + +##### **Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization** +2502.14496v1 by Zhitao He, Zijun Liu, Peng Li, May Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu + +LLM-based agents have made significant advancements in interactive +environments, such as mobile operations and web browsing, and other domains +beyond computer using. Current multi-agent systems universally excel in +performance, compared to single agents, but struggle with generalization across +environments due to predefined roles and inadequate strategies for generalizing +language agents. The challenge of achieving both strong performance and good +generalization has hindered the progress of multi-agent systems for interactive +environments. To address these issues, we propose CollabUIAgents, a multi-agent +reinforcement learning framework with a novel multi-agent credit re-assignment +(CR) strategy, assigning process rewards with LLMs rather than +environment-specific rewards and learning with synthesized preference data, in +order to foster generalizable, collaborative behaviors among the role-free +agents' policies. Empirical results show that our framework improves both +performance and cross-environment generalizability of multi-agent systems. +Moreover, our 7B-parameter system achieves results on par with or exceed strong +closed-source models, and the LLM that guides the CR. We also provide insights +in using granular CR rewards effectively for environment generalization, and +accommodating trained LLMs in multi-agent systems. + +摘要:基於 LLM 的代理在互動式環境中取得重大進展,例如行動運算和網頁瀏覽,以及電腦使用以外的其他領域。與單一代理相比,目前的 Multi-Agent 系統在效能上普遍表現出色,但由於預先定義的角色和不適當的語言代理概化策略,導致難以跨環境概化。在互動式環境中,同時達成強大效能和良好概化的挑戰,阻礙了 Multi-Agent 系統的進展。為了解決這些問題,我們提出 CollabUIAgents,這是一個 Multi-Agent 強化學習架構,具備創新的 Multi-Agent 信用重新分配 (CR) 策略,使用 LLM 而不是特定於環境的獎勵來分配程序獎勵,並透過綜合偏好資料進行學習,以促進無角色代理政策之間可概化的協作行為。經驗結果顯示,我們的架構同時改善了 Multi-Agent 系統的效能和跨環境概化能力。此外,我們的 7B 參數系統在效能上與強大的閉源模型和引導 CR 的 LLM 相當或超越它們。我們也提供見解,說明如何有效地使用細粒化的 CR 獎勵來進行環境概化,以及如何在 Multi-Agent 系統中容納受過訓練的 LLM。 + +##### **StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following** +2502.14494v1 by Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu + +Multi-turn instruction following capability constitutes a core competency of +large language models (LLMs) in real-world applications. Existing evaluation +benchmarks predominantly focus on fine-grained constraint satisfaction and +domain-specific capability assessment, yet overlook the crucial structural +dependency between dialogue turns that distinguishes multi-turn from +single-turn interactions. This structural dependency not only reflects user +intent but also establishes a second dimension for instruction following +evaluation beyond constraint satisfaction. To address this gap, we propose +StructFlowBench, a multi-turn instruction following benchmark with structural +flow modeling. The benchmark innovatively defines a structural flow framework +comprising six fundamental inter-turn relationships, which not only introduces +novel structural constraints for model evaluation but also serves as generation +parameters for creating customized dialogue flows tailored to specific +scenarios. Adopting established LLM-based automatic evaluation methodologies, +we conduct systematic evaluations of 13 leading open-source and closed-source +LLMs. Experimental results reveal significant deficiencies in current models' +comprehension of multi-turn dialogue structures. The code is available at +\url{https://github.com/MLGroupJLU/StructFlowBench}. + +摘要:多輪指令遵循能力構成大型語言模型 (LLM) 在現實世界應用中的核心能力。現有的評估基準主要專注於細粒度的約束滿足和特定領域的能力評估,卻忽略了多輪與單輪互動之間區別的關鍵結構依賴性。這種結構依賴性不僅反映了使用者的意圖,也為指令遵循評估建立了超越約束滿足的第二個維度。為了解決這個差距,我們提出了 StructFlowBench,一個具有結構流建模的多輪指令遵循基準。該基準創新地定義了一個結構流框架,包含六個基本的回合間關係,這不僅引入了模型評估的新結構約束,還可用作生成參數,用於創建針對特定場景定制的對話流。採用已建立的基於 LLM 的自動評估方法,我們對 13 個領先的開源和閉源 LLM 進行了系統評估。實驗結果揭示了當前模型在理解多輪對話結構方面存在顯著缺陷。程式碼可在 \url{https://github.com/MLGroupJLU/StructFlowBench} 取得。 + +##### **Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk** +2502.14491v1 by Elija Perrier + +Evaluating AI safety requires statistically rigorous methods and risk metrics +for understanding how the use of AI affects aggregated risk. However, much AI +safety literature focuses upon risks arising from AI models in isolation, +lacking consideration of how modular use of AI affects risk distribution of +workflow components or overall risk metrics. There is also a lack of +statistical grounding enabling sensitisation of risk models in the presence of +absence of AI to estimate causal contributions of AI. This is in part due to +the dearth of AI impact data upon which to fit distributions. In this work, we +address these gaps in two ways. First, we demonstrate how scenario modelling +(grounded in established statistical techniques such as Markov chains, copulas +and Monte Carlo simulation) can be used to model AI risk holistically. Second, +we show how lookalike distributions from phenomena analogous to AI can be used +to estimate AI impacts in the absence of directly observable data. We +demonstrate the utility of our methods for benchmarking cumulative AI risk via +risk analysis of a logistic scenario simulations. + +摘要:評估 AI 安全性需要嚴格的統計方法和風險指標,以了解 AI 的使用如何影響累積風險。然而,許多 AI 安全性文獻著重於 AI 模型孤立產生的風險,缺乏考量 AI 的模組化使用如何影響工作流程組件的風險分佈或整體風險指標。在有或沒有 AI 的情況下,統計基礎也缺乏讓風險模型敏感化的能力,以估計 AI 的因果關係貢獻。這部分是因為缺乏 AI 影響資料來擬合分佈。在這項研究中,我們以兩種方式解決這些差距。首先,我們展示情境建模(建立在已建立的統計技術上,例如馬可夫鏈、copula 和蒙地卡羅模擬)如何用於整體建模 AI 風險。其次,我們展示如何使用類似於 AI 現象的相似分佈來估計在沒有直接可觀察資料的情況下 AI 的影響。我們透過後勤情境模擬的風險分析,展示了我們的方法對於評量累積 AI 風險的效用。 + +##### **Temporal Misalignment and Probabilistic Neurons** +2502.14487v1 by Velibor Bojković, Xiaofeng Wu, Bin Gu + +Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to +Artificial Neural Networks (ANNs) by mimicking biological neural principles, +establishing them as a promising approach to mitigate the increasing energy +demands of large-scale neural models. However, fully harnessing the +capabilities of SNNs remains challenging due to their discrete signal +processing and temporal dynamics. ANN-SNN conversion has emerged as a practical +approach, enabling SNNs to achieve competitive performance on complex machine +learning tasks. In this work, we identify a phenomenon in the ANN-SNN +conversion framework, termed temporal misalignment, in which random spike +rearrangement across SNN layers leads to performance improvements. Based on +this observation, we introduce biologically plausible two-phase probabilistic +(TPP) spiking neurons, further enhancing the conversion process. We demonstrate +the advantages of our proposed method both theoretically and empirically +through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet +across a variety of architectures, achieving state-of-the-art results. + +摘要:脈衝神經網路 (SNN) 模仿生物神經原理,提供了一種比人工神經網路 (ANN) 更省能的替代方案,確立了它們作為緩解大型神經模型日益增長能耗需求的一種有前途的方法。然而,由於 SNN 的離散訊號處理和時間動態,要充分利用 SNN 的功能仍然具有挑戰性。ANN-SNN 轉換已經成為一種實用的方法,使 SNN 能夠在複雜機器學習任務中實現競爭性能。在這項工作中,我們在 ANN-SNN 轉換框架中發現了一種現象,稱為時間錯位,其中隨機脈衝在 SNN 層之間重新排列會導致性能提升。基於這一觀察,我們引入了生物學上合理的兩階段機率 (TPP) 脈衝神經元,進一步增強了轉換過程。我們通過在 CIFAR-10/100、CIFAR10-DVS 和 ImageNet 上對各種架構進行綜合實驗,從理論和經驗上證明了我們提出的方法的優點,取得了最先進的結果。 + +##### **How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation** +2502.14486v1 by Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang, Zhongyu Wei + +Jailbreak attacks, where harmful prompts bypass generative models' built-in +safety, raise serious concerns about model vulnerability. While many defense +methods have been proposed, the trade-offs between safety and helpfulness, and +their application to Large Vision-Language Models (LVLMs), are not well +understood. This paper systematically examines jailbreak defenses by reframing +the standard generation task as a binary classification problem to assess model +refusal tendencies for both harmful and benign queries. We identify two key +defense mechanisms: safety shift, which increases refusal rates across all +queries, and harmfulness discrimination, which improves the model's ability to +distinguish between harmful and benign inputs. Using these mechanisms, we +develop two ensemble defense strategies-inter-mechanism ensembles and +intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the +MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these +strategies effectively improve model safety or optimize the trade-off between +safety and helpfulness. + +摘要:越獄攻擊,其中有害提示繞過生成模型內建的安全機制,引發了對模型漏洞的嚴重疑慮。雖然已提出許多防禦方法,但安全性與有益性之間的取捨,以及它們在大型視覺語言模型 (LVLMs) 中的應用,尚未得到充分理解。本文透過將標準生成任務重新定義為二元分類問題,系統性地檢視越獄防禦,以評估模型對有害和良性查詢的拒絕傾向。我們找出兩種關鍵的防禦機制:安全轉移,這會提高所有查詢的拒絕率,以及危害區分,這會提升模型區分有害和良性輸入的能力。使用這些機制,我們開發出兩種整體防禦策略,機制間整體和機制內整體,以平衡安全性與有益性。在使用 LLaVA-1.5 模型的 MM-SafetyBench 和 MOSSBench 資料集上進行的實驗顯示,這些策略有效地提升了模型安全性,或最佳化了安全性與有益性之間的取捨。 + +##### **NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models** +2502.14482v1 by Chenlu Guo, Yuan Wu, Yi Chang + +Parameter-efficient fine-tuning (PEFT) is essential for adapting large +language models (LLMs), with low-rank adaptation (LoRA) being the most popular +approach. However, LoRA suffers from slow convergence, and some recent LoRA +variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD) +for initialization, leading to expensive computation. To mitigate these +problems, we use the Nystr\"om method, which follows a three-matrix +manipulation. We first introduce StructuredLoRA (SLoRA), which investigates +adding a small intermediate matrix between the low-rank matrices A and B. +Secondly, we propose Nystr\"omLoRA (NLoRA), which leverages Nystr\"om-based +initialization for SLoRA to improve its effectiveness and efficiency. Finally, +we propose IntermediateTune (IntTune), which explores fine-tuning exclusively +on the intermediate matrix of NLoRA to further boost LLM efficiency. We +evaluate our methods on five natural language generation (NLG) tasks and eight +natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve +accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with +only 3.67 million additional trainable parameters. IntTune improves average NLG +performance over LoRA by 7.45% while using only 1.25% of its parameters. These +results demonstrate the efficiency and effectiveness of our approach in +enhancing model performance with minimal parameter overhead. + +摘要:參數高效微調 (PEFT) 對於調整大型語言模型 (LLM) 至關重要,其中低秩調整 (LoRA) 是最受歡迎的方法。然而,LoRA 存在收斂速度慢的問題,而一些最近的 LoRA 變體,例如 PiSSA,主要依賴奇異值分解 (SVD) 進行初始化,導致運算成本高昂。為了減輕這些問題,我們使用了 Nystr\"om 方法,它遵循三矩陣操作。我們首先介紹 StructuredLoRA (SLoRA),它研究在低秩矩陣 A 和 B 之間添加一個小的中間矩陣。其次,我們提出了 Nystr\"omLoRA (NLoRA),它利用基於 Nystr\"om 的初始化方法為 SLoRA 提升其有效性和效率。最後,我們提出了 IntermediateTune (IntTune),它探討了僅對 NLoRA 的中間矩陣進行微調,以進一步提升 LLM 效率。我們在五項自然語言生成 (NLG) 任務和八項自然語言理解 (NLU) 任務上評估了我們的這些方法。在 GSM8K 上,SLoRA 和 NLoRA 分別達到了 56.48% 和 57.70% 的準確率,比 LoRA 高出 33.52% 和 36.41%,而僅增加了 367 萬個可訓練參數。IntTune 在僅使用 LoRA 1.25% 的參數的情況下,將平均 NLG 效能提升了 7.45%。這些結果證明了我們的方法在以最少的參數開銷提升模型效能方面的效率和有效性。 + +##### **Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression** +2502.14477v1 by Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang + +Handling long-context sequences efficiently remains a significant challenge +in large language models (LLMs). Existing methods for token selection in +sequence extrapolation either employ a permanent eviction strategy or select +tokens by chunk, which may lead to the loss of critical information. We propose +Efficient Selective Attention (ESA), a novel approach that extends context +length by efficiently selecting the most critical tokens at the token level to +compute attention. ESA reduces the computational complexity of token selection +by compressing query and key vectors into lower-dimensional representations. We +evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using +open-source LLMs with context lengths of 8k and 32k. ESA outperforms other +selective attention methods, especially in tasks requiring the retrieval of +multiple pieces of information, achieving comparable performance to +full-attention extrapolation methods across various tasks, with superior +results in certain tasks. + +摘要:在大型語言模型 (LLM) 中,有效處理長語境序列仍然是一項重大挑戰。現有的序列外推標記選擇方法採用永久驅逐策略或按塊選擇標記,這可能會導致關鍵資訊遺失。我們提出高效選擇性注意 (ESA),這是一種新穎的方法,它透過在標記層級有效選擇最關鍵的標記來計算注意,從而延伸語境長度。ESA 透過將查詢和關鍵向量壓縮成較低維度的表示,來降低標記選擇的運算複雜度。我們使用開放原始碼 LLM,在語境長度為 8k 和 32k 的情況下,對長序列基準進行評估,最大長度達 256k。ESA 的表現優於其他選擇性注意方法,特別是在需要擷取多條資訊的任務中,在各種任務中達到與全注意外推方法相當的效能,並且在某些任務中獲得更佳的結果。 + +##### **Argument-Based Comparative Question Answering Evaluation Benchmark** +2502.14476v1 by Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann + +In this paper, we aim to solve the problems standing in the way of automatic +comparative question answering. To this end, we propose an evaluation framework +to assess the quality of comparative question answering summaries. We formulate +15 criteria for assessing comparative answers created using manual annotation +and annotation from 6 large language models and two comparative question +asnwering datasets. We perform our tests using several LLMs and manual +annotation under different settings and demonstrate the constituency of both +evaluations. Our results demonstrate that the Llama-3 70B Instruct model +demonstrates the best results for summary evaluation, while GPT-4 is the best +for answering comparative questions. All used data, code, and evaluation +results are publicly +available\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}. + +摘要:在本文中,我們旨在解決阻礙自動比較性問題解答的難題。為此,我們提出一個評估框架,用於評估比較性問題解答摘要的品質。我們制定了 15 項準則,用於評估使用手動標註和來自 6 個大型語言模型和兩個比較性問題解答資料集的標註所建立的比較性答案。我們在不同的設定下使用幾個 LLM 和手動標註執行測試,並展示兩種評估的組成。我們的結果表明,Llama-3 70B Instruct 模型在摘要評估中表現最佳,而 GPT-4 在回答比較性問題方面表現最佳。所有使用過的資料、程式碼和評估結果均公開可用\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}。 + +##### **Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models** +2502.14469v1 by Aurora Polo-Rodríguez, Laura Fiorini, Erika Rovini, Filippo Cavallo, Javier Medina-Quero + +This work presents a novel architecture for context-aware interactions within +smart environments, leveraging Large Language Models (LLMs) to enhance user +experiences. Our system integrates user location data obtained through UWB tags +and sensor-equipped smart homes with real-time human activity recognition (HAR) +to provide a comprehensive understanding of user context. This contextual +information is then fed to an LLM-powered chatbot, enabling it to generate +personalised interactions and recommendations based on the user's current +activity and environment. This approach moves beyond traditional static chatbot +interactions by dynamically adapting to the user's real-time situation. A case +study conducted from a real-world dataset demonstrates the feasibility and +effectiveness of our proposed architecture, showcasing its potential to create +more intuitive and helpful interactions within smart homes. The results +highlight the significant benefits of integrating LLM with real-time activity +and location data to deliver personalised and contextually relevant user +experiences. + +摘要:本研究提出了一種創新的架構,用於在智慧環境中進行情境感知互動,利用大型語言模型 (LLM) 來提升使用者體驗。我們的系統整合了透過超寬頻標籤取得的使用者位置資料,以及配備感測器的智慧家庭,並具備即時人類活動辨識 (HAR),以全面了解使用者的情境。接著,將這些情境資訊輸入 LLM 驅動的聊天機器人,讓它能根據使用者的當前活動和環境產生個人化的互動和建議。這種方法超越了傳統的靜態聊天機器人互動,能動態地適應使用者的即時狀況。從真實世界資料集進行的案例研究,展示了我們提出的架構的可行性和有效性,突顯出它在智慧家庭中創造更直覺且有用的互動的潛力。結果突顯了將 LLM 與即時活動和位置資料整合,以提供個人化且與情境相關的使用者體驗的顯著優點。 + +##### **Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing** +2502.14458v1 by Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, Albert Gu + +We introduce Llamba, a family of efficient recurrent language models +distilled from Llama-3.x into the Mamba architecture. The series includes +Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput +and handle significantly larger batch sizes than Transformer-based models while +maintaining comparable benchmark performance. Furthermore, Llamba demonstrates +the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., +2024), achieving these results with less than 0.1% of the training data +typically used for models of similar size. To take full advantage of their +efficiency, we provide an optimized implementation of Llamba for +resource-constrained devices such as smartphones and edge platforms, offering a +practical and memory-efficient alternative to Transformers. Overall, Llamba +improves the tradeoff between speed, memory efficiency, and performance, making +high-quality language models more accessible. + +摘要:我們推出 Llamba,一種高效的遞迴語言模型家族,從 Llama-3.x 萃取到 Mamba 架構中。該系列包含 Llamba-1B、Llamba-3B 和 Llamba-8B,它們比基於 Transformer 的模型實現更高的推理吞吐量,並處理顯著更大的批次大小,同時保持可比較的基準效能。此外,Llamba 證明了使用 MOHAWK(Bick 等人,2024 年)進行跨架構萃取的有效性,在訓練資料不到類似大小模型通常使用的 0.1% 的情況下實現了這些結果。為了充分利用其效率,我們為 Llamba 提供了針對資源受限裝置(例如智慧型手機和邊緣平台)的最佳化實作,提供實用且記憶體效率高的 Transformer 替代方案。總體而言,Llamba 改善了速度、記憶體效率和效能之間的權衡,讓高品質語言模型更易於取得。 + +##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** +2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu + +To enhance tourists' experiences and immersion, this paper proposes a +narrative-driven travel planning framework called NarrativeGuide, which +generates a geoculturally-grounded narrative script for travelers, offering a +novel, role-playing experience for their journey. In the initial stage, +NarrativeGuide constructs a knowledge graph for attractions within a city, then +configures the worldview, character setting, and exposition based on the +knowledge graph. Using this foundation, the knowledge graph is combined to +generate an independent scene unit for each attraction. During the itinerary +planning stage, NarrativeGuide models narrative-driven travel planning as an +optimization problem, utilizing a genetic algorithm (GA) to refine the +itinerary. Before evaluating the candidate itinerary, transition scripts are +generated for each pair of adjacent attractions, which, along with the scene +units, form a complete script. The weighted sum of script coherence, travel +time, and attraction scores is then used as the fitness value to update the +candidate solution set. Experimental results across four cities, i.e., Nanjing +and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate +significant improvements in narrative coherence and cultural fit, alongside a +notable reduction in travel time and an increase in the quality of visited +attractions. Our study highlights that incorporating external evolutionary +optimization effectively addresses the limitations of large language models in +travel planning.Our codes are available at +https://github.com/Evan01225/Narrative-Driven-Travel-Planning. + +摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 + +##### **Optimal word order for non-causal text generation with Large Language Models: the Spanish case** +2502.14451v1 by Andrea Busto-Castiñeira, Silvia García-Méndez, Francisco de Arriba-Pérez, Francisco J. González-Castaño + +Natural Language Generation (NLG) popularity has increased owing to the +progress in Large Language Models (LLMs), with zero-shot inference +capabilities. However, most neural systems utilize decoder-only causal +(unidirectional) transformer models, which are effective for English but may +reduce the richness of languages with less strict word order, subject omission, +or different relative clause attachment preferences. This is the first work +that analytically addresses optimal text generation order for non-causal +language models. We present a novel Viterbi algorithm-based methodology for +maximum likelihood word order estimation. We analyze the non-causal +most-likelihood order probability for NLG in Spanish and, then, the probability +of generating the same phrases with Spanish causal NLG. This comparative +analysis reveals that causal NLG prefers English-like SVO structures. We also +analyze the relationship between optimal generation order and causal +left-to-right generation order using Spearman's rank correlation. Our results +demonstrate that the ideal order predicted by the maximum likelihood estimator +is not closely related to the causal order and may be influenced by the +syntactic structure of the target sentence. + +摘要:自然語言生成 (NLG) 的普及歸功於大型語言模型 (LLM) 的進步,以及零次學習推論能力。然而,大多數神經系統使用僅解碼器因果 (單向) Transformer模型,這對英語很有效,但可能會減少語序較不嚴謹、省略主詞或相對從句附加偏好不同的語言的豐富性。這是第一個針對非因果語言模型分析性地解決最佳文字生成順序的研究。我們提出了一種基於維特比演算法的新方法,用於最大似然詞序估計。我們分析了西班牙語 NLG 的非因果最大似然順序機率,然後分析了使用西班牙語因果 NLG 生成相同短語的機率。這種比較分析顯示,因果 NLG 偏好英語式的 SVO 結構。我們還使用 Spearman 等級相關性分析最佳生成順序和因果從左到右生成順序之間的關係。我們的結果表明,最大似然估計器預測的理想順序與因果順序沒有密切關係,並且可能會受到目標句子的語法結構影響。 + + +### Knowledge Graphs +|Publish Date|Title|Authors|Homepage|Code| +| :---: | :---: | :---: | :---: | :---: | +|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| +|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| +|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| +|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| +|**2025-02-20**|**Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment**|Jiaxi Li et.al.|[2502.14275v1](http://arxiv.org/abs/2502.14275v1)|null| +|**2025-02-20**|**Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering**|Rongzhi Zhu et.al.|[2502.14245v1](http://arxiv.org/abs/2502.14245v1)|null| +|**2025-02-20**|**NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM**|Jiayin Lan et.al.|[2502.14192v1](http://arxiv.org/abs/2502.14192v1)|null| +|**2025-02-19**|**Object-centric Binding in Contrastive Language-Image Pretraining**|Rim Assouel et.al.|[2502.14113v1](http://arxiv.org/abs/2502.14113v1)|null| +|**2025-02-19**|**Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning**|Cole Gawin et.al.|[2502.14086v1](http://arxiv.org/abs/2502.14086v1)|null| +|**2025-02-19**|**Neurosymbolic artificial intelligence via large language models and coherence-driven inference**|Steve Huntsman et.al.|[2502.13953v1](http://arxiv.org/abs/2502.13953v1)|null| +|**2025-02-19**|**Complex Ontology Matching with Large Language Model Embeddings**|Guilherme Sousa et.al.|[2502.13619v1](http://arxiv.org/abs/2502.13619v1)|null| +|**2025-02-19**|**Are Large Language Models In-Context Graph Learners?**|Jintang Li et.al.|[2502.13562v1](http://arxiv.org/abs/2502.13562v1)|null| +|**2025-02-19**|**Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs**|Yushi Feng et.al.|[2502.13555v1](http://arxiv.org/abs/2502.13555v1)|[link](https://github.com/ys-feng/DemoGraph)| +|**2025-02-19**|**PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference**|Burc Gokden et.al.|[2502.13502v1](http://arxiv.org/abs/2502.13502v1)|[link](https://github.com/burcgokden/PLDR-LLM-with-KVG-cache)| +|**2025-02-19**|**Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction**|Yanbang Sun et.al.|[2502.13412v1](http://arxiv.org/abs/2502.13412v1)|null| +|**2025-02-19**|**Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval**|Aditya Sharma et.al.|[2502.13369v1](http://arxiv.org/abs/2502.13369v1)|null| +|**2025-02-19**|**Craw4LLM: Efficient Web Crawling for LLM Pretraining**|Shi Yu et.al.|[2502.13347v1](http://arxiv.org/abs/2502.13347v1)|[link](https://github.com/cxcscmu/crawl4llm)| +|**2025-02-18**|**K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction**|Tassallah Abdullahi et.al.|[2502.13344v1](http://arxiv.org/abs/2502.13344v1)|[link](https://github.com/rsinghlab/K-Paths)| +|**2025-02-18**|**Grounding LLM Reasoning with Knowledge Graphs**|Alfonso Amayuelas et.al.|[2502.13247v1](http://arxiv.org/abs/2502.13247v1)|null| +|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null| +|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null| +|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null| +|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null| +|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null| +|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|[link](https://github.com/yuhan1i/g-refer)| +|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null| +|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null| +|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null| +|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|[link](https://github.com/acnlplab/umr-text-gen)| +|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null| +|**2025-02-17**|**Exploring LLM-based Student Simulation for Metacognitive Cultivation**|Haoxuan Li et.al.|[2502.11678v1](http://arxiv.org/abs/2502.11678v1)|null| +|**2025-02-17**|**Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering**|Runxuan Liu et.al.|[2502.11491v1](http://arxiv.org/abs/2502.11491v1)|null| +|**2025-02-17**|**GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion**|Kangyang Luo et.al.|[2502.11471v1](http://arxiv.org/abs/2502.11471v1)|null| +|**2025-02-16**|**Large Language-Geometry Model: When LLM meets Equivariance**|Zongzhao Li et.al.|[2502.11149v2](http://arxiv.org/abs/2502.11149v2)|null| +|**2025-02-16**|**Beyond Pairwise: Global Zero-shot Temporal Graph Generation**|Alon Eirew et.al.|[2502.11114v1](http://arxiv.org/abs/2502.11114v1)|null| +|**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|[link](https://github.com/alexlecu/llmkgraph)| +|**2025-02-16**|**Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection**|Yang Zhao et.al.|[2502.11062v1](http://arxiv.org/abs/2502.11062v1)|null| +|**2025-02-16**|**CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models**|Yuefei Chen et.al.|[2502.11008v1](http://arxiv.org/abs/2502.11008v1)|null| +|**2025-02-16**|**RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation**|Pengcheng Jiang et.al.|[2502.10996v1](http://arxiv.org/abs/2502.10996v1)|[link](https://github.com/pat-jj/Retrieval-And-Structure)| +|**2025-02-15**|**Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia**|Rohith Perumandla et.al.|[2502.10896v1](http://arxiv.org/abs/2502.10896v1)|null| +|**2025-02-15**|**Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)**|Sandra Schaftner et.al.|[2502.10768v1](http://arxiv.org/abs/2502.10768v1)|null| +|**2025-02-15**|**K-Edit: Language Model Editing with Contextual Knowledge Awareness**|Elan Markowitz et.al.|[2502.10626v1](http://arxiv.org/abs/2502.10626v1)|null| +|**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null| +|**2025-02-14**|**GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs**|Shima Khoshraftar et.al.|[2502.10522v1](http://arxiv.org/abs/2502.10522v1)|null| +|**2025-02-14**|**Do Large Language Models Reason Causally Like Us? Even Better?**|Hanna M. Dettki et.al.|[2502.10215v1](http://arxiv.org/abs/2502.10215v1)|null| +|**2025-02-14**|**Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages**|Daniil Gurgurov et.al.|[2502.10140v1](http://arxiv.org/abs/2502.10140v1)|null| +|**2025-02-14**|**Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models**|Chenrui Tie et.al.|[2502.10090v1](http://arxiv.org/abs/2502.10090v1)|null| +|**2025-02-14**|**Decision Information Meets Large Language Models: The Future of Explainable Operations Research**|Yansen Zhang et.al.|[2502.09994v1](http://arxiv.org/abs/2502.09994v1)|null| +|**2025-02-14**|**KGGen: Extracting Knowledge Graphs from Plain Text with Language Models**|Belinda Mo et.al.|[2502.09956v1](http://arxiv.org/abs/2502.09956v1)|null| +|**2025-02-14**|**ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation**|Shu Wang et.al.|[2502.09891v1](http://arxiv.org/abs/2502.09891v1)|null| +|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null| +|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null| +|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null| +|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v3](http://arxiv.org/abs/2502.08346v3)|null| +|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null| +|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null| +|**2025-02-12**|**LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search**|Yang Gao et.al.|[2502.10459v1](http://arxiv.org/abs/2502.10459v1)|null| +|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null| +|**2025-02-12**|**Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality**|Xin Kang et.al.|[2502.09658v1](http://arxiv.org/abs/2502.09658v1)|null| +|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null| +|**2025-02-12**|**Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach**|Régnier Avice et.al.|[2502.10453v1](http://arxiv.org/abs/2502.10453v1)|[link](https://github.com/ravice234/cryptoasset-attribution-tag-linker)| +|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null| +|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null| +|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)| +|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null| +|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|[link](https://github.com/YuxingLu613/KARMA)| +|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null| +|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null| +|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null| +|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null| +|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null| +|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null| +|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null| +|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null| +|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)| +|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null| +|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|[link](https://github.com/Infini-AI-Lab/gsm_infinite)| +|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null| +|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)| +|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null| +|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)| +|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)| +|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null| +|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)| +|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)| +|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null| +|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null| +|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null| +|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null| +|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null| +|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null| +|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null| +|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null| +|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null| +|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null| +|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null| +|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)| +|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null| +|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null| +|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)| + +#### Abstracts +##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** +2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu + +Large Language Models (LLMs) have shown great promise in tool-making, yet +existing frameworks often struggle to efficiently construct reliable toolsets +and are limited to single-task settings. To address these challenges, we +propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that +dynamically constructs and evolves a hierarchical graph of reusable tools +across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), +agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, +TabMWP). Our results show that GATE achieves up to 4.3x faster milestone +completion in Minecraft compared to the previous SOTA, and provides an average +improvement of 9.23% over existing tool-making methods in code generation tasks +and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, +balancing tool quantity, complexity, and functionality while maintaining high +efficiency. Code and data are available at +\url{https://github.com/ayanami2003/GATE}. + +摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 + +##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** +2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su + +Our ability to continuously acquire, organize, and leverage knowledge is a +key feature of human intelligence that AI systems must approximate to unlock +their full potential. Given the challenges in continual learning with large +language models (LLMs), retrieval-augmented generation (RAG) has become the +dominant way to introduce new information. However, its reliance on vector +retrieval hinders its ability to mimic the dynamic and interconnected nature of +human long-term memory. Recent RAG approaches augment vector embeddings with +various structures like knowledge graphs to address some of these gaps, namely +sense-making and associativity. However, their performance on more basic +factual memory tasks drops considerably below standard RAG. We address this +unintended deterioration and propose HippoRAG 2, a framework that outperforms +standard RAG comprehensively on factual, sense-making, and associative memory +tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in +HippoRAG and enhances it with deeper passage integration and more effective +online use of an LLM. This combination pushes this RAG system closer to the +effectiveness of human long-term memory, achieving a 7% improvement in +associative memory tasks over the state-of-the-art embedding model while also +exhibiting superior factual knowledge and sense-making memory capabilities. +This work paves the way for non-parametric continual learning for LLMs. Our +code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. + +摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 + +##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** +2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao + +Large Language Models (LLMs) have demonstrated exceptional abilities in +reasoning for task planning. However, challenges remain under-explored for +parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in +which the model first decomposes a real-life textual task into executable +subtasks and constructs an abstract task graph. The model then understands this +task graph as input and generates a plan for parallel execution. To enhance the +planning capability of complex, scalable graphs, we design an automated and +controllable pipeline to generate synthetic graphs and propose a two-stage +training scheme. Experimental results show that our plan-over-graph method +significantly improves task performance on both API-based LLMs and trainable +open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally +supports parallel execution, demonstrating global efficiency. The code and data +are available at https://github.com/zsq259/Plan-over-Graph. + +摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 + +##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** +2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu + +To enhance tourists' experiences and immersion, this paper proposes a +narrative-driven travel planning framework called NarrativeGuide, which +generates a geoculturally-grounded narrative script for travelers, offering a +novel, role-playing experience for their journey. In the initial stage, +NarrativeGuide constructs a knowledge graph for attractions within a city, then +configures the worldview, character setting, and exposition based on the +knowledge graph. Using this foundation, the knowledge graph is combined to +generate an independent scene unit for each attraction. During the itinerary +planning stage, NarrativeGuide models narrative-driven travel planning as an +optimization problem, utilizing a genetic algorithm (GA) to refine the +itinerary. Before evaluating the candidate itinerary, transition scripts are +generated for each pair of adjacent attractions, which, along with the scene +units, form a complete script. The weighted sum of script coherence, travel +time, and attraction scores is then used as the fitness value to update the +candidate solution set. Experimental results across four cities, i.e., Nanjing +and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate +significant improvements in narrative coherence and cultural fit, alongside a +notable reduction in travel time and an increase in the quality of visited +attractions. Our study highlights that incorporating external evolutionary +optimization effectively addresses the limitations of large language models in +travel planning.Our codes are available at +https://github.com/Evan01225/Narrative-Driven-Travel-Planning. + +摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 + +##### **Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment** +2502.14275v1 by Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu + +Large language models (LLMs) have been widely adopted in various downstream +task domains. However, their ability to directly recall and apply factual +medical knowledge remains under-explored. Most existing medical QA benchmarks +assess complex reasoning or multi-hop inference, making it difficult to isolate +LLMs' inherent medical knowledge from their reasoning capabilities. Given the +high-stakes nature of medical applications, where incorrect information can +have critical consequences, it is essential to evaluate how well LLMs encode, +retain, and recall fundamental medical facts. + To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset +specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ +is constructed from the Unified Medical Language System (UMLS), a large-scale +repository of standardized biomedical vocabularies and knowledge graphs. We +frame knowledge assessment as a binary judgment task, requiring LLMs to verify +the correctness of medical statements extracted from reliable and structured +knowledge sources. + Our experiments reveal that LLMs struggle with factual medical knowledge +retention, exhibiting significant performance variance across different +semantic categories, particularly for rare medical conditions. Furthermore, +LLMs show poor calibration, often being overconfident in incorrect answers. To +mitigate these issues, we explore retrieval-augmented generation, demonstrating +its effectiveness in improving factual accuracy and reducing uncertainty in +medical decision-making. + +摘要:大型語言模型 (LLM) 已廣泛應用於各種下游 +任務領域。然而,它們直接回憶和應用事實 +醫學知識的能力仍未得到充分探索。大多數現有的醫療問答基準 +評估複雜推理或多跳躍推論,這使得難以將 +LLM 內在的醫學知識從其推理能力中分離出來。鑑於 +醫療應用具有高風險,其中不正確的資訊可能會 +造成嚴重後果,因此評估 LLM 編碼、 +保留和回憶基本醫學事實的能力至關重要。 +為了彌合這一差距,我們引入了醫學知識判斷,這是一個專門設計用於測量 LLM 的一跳事實醫學知識的數據集。MKJ +是由統一醫學語言系統 (UMLS) 構建的,UMLS 是標準化生物醫學詞彙和知識圖譜的大型庫。我們 +將知識評估構建為二元判斷任務,要求 LLM 驗證從可靠且結構化的 +知識來源中提取的醫學陳述的正確性。 +我們的實驗表明,LLM 難以保留事實醫學知識,在不同的 +語義類別中表現出顯著的性能差異,特別是對於罕見的醫療狀況。此外, +LLM 表現出校準不佳,通常對不正確的答案過於自信。為了 +減輕這些問題,我們探索了檢索增強生成,證明了其在提高事實準確性和降低不確定性方面的有效性 +在醫療決策制定中。 + +##### **Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering** +2502.14245v1 by Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu + +In this paper, we identify a critical problem, "lost-in-retrieval", in +retrieval-augmented multi-hop question answering (QA): the key entities are +missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly +degrades the retrieval performance, which disrupts the reasoning chain and +leads to the incorrect answers. To resolve this problem, we propose a +progressive retrieval and rewriting method, namely ChainRAG, which sequentially +handles each sub-question by completing missing key entities and retrieving +relevant sentences from a sentence graph for answer generation. Each step in +our retrieval and rewriting process builds upon the previous one, creating a +seamless chain that leads to accurate retrieval and answers. Finally, all +retrieved sentences and sub-question answers are integrated to generate a +comprehensive answer to the original question. We evaluate ChainRAG on three +multi-hop QA datasets$\unicode{x2013}$MuSiQue, 2Wiki, and +HotpotQA$\unicode{x2013}$using three large language models: GPT4o-mini, +Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG +consistently outperforms baselines in both effectiveness and efficiency. + +摘要:在本文中,我們在檢索增強的多跳問答 (QA) 中發現了一個關鍵問題「檢索中遺失」,關鍵實體遺失在 LLM 的子問題分解中。「檢索中遺失」顯著降低檢索效能,這會中斷推理鏈並導致錯誤的答案。為了解決此問題,我們提出了一種漸進式檢索和重寫方法,即 ChainRAG,它通過完成遺失的關鍵實體並從句子圖中檢索相關句子來順序處理每個子問題以產生答案。我們檢索和重寫過程中每一步都建立在前一步之上,創造了一個無縫的鏈,導致準確的檢索和答案。最後,所有檢索到的句子和子問題答案都整合起來,以產生對原始問題的全面答案。我們在三個多跳問答資料集$\unicode{x2013}$MuSiQue、2Wiki 和 HotpotQA$\unicode{x2013}$上評估 ChainRAG,使用三個大型語言模型:GPT4o-mini、Qwen2.5-72B 和 GLM-4-Plus。實證結果表明,ChainRAG 在有效性和效率方面都持續優於基準。 + +##### **NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM** +2502.14192v1 by Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin + +Large language models (LLMs) have been widely applied in question answering +over scientific research papers. To enhance the professionalism and accuracy of +responses, many studies employ external knowledge augmentation. However, +existing structures of external knowledge in scientific literature often focus +solely on either paper entities or domain concepts, neglecting the intrinsic +connections between papers through shared domain concepts. This results in less +comprehensive and specific answers when addressing questions that combine +papers and concepts. To address this, we propose a novel knowledge graph +framework that captures deep conceptual relations between academic papers, +constructing a relational network via intra-paper semantic elements and +inter-paper citation relations. Using a few-shot knowledge graph construction +method based on LLM, we develop NLP-AKG, an academic knowledge graph for the +NLP domain, by extracting 620,353 entities and 2,271,584 relations from 60,826 +papers in ACL Anthology. Based on this, we propose a 'sub-graph community +summary' method and validate its effectiveness on three NLP scientific +literature question answering datasets. + +摘要:大型语言模型 (LLM) 已广泛应用于科学研究论文的问答中。为了提高响应的专业性和准确性,许多研究采用外部知识增强。然而,科学文献中现有外部知识的结构通常仅关注论文实体或领域概念,而忽略了论文之间通过共享领域概念而形成的内在联系。这导致在解决结合论文和概念的问题时,答案不够全面和具体。为了解决这个问题,我们提出了一种新颖的知识图谱框架,该框架捕获了学术论文之间的深层概念关系,通过论文内部语义元素和论文之间的引用关系构建关系网络。我们使用基于 LLM 的少量知识图谱构建方法,从 ACL Anthology 中的 60,826 篇论文中提取了 620,353 个实体和 2,271,584 个关系,开发了 NLP 领域的学术知识图谱 NLP-AKG。在此基础上,我们提出了一种“子图社区摘要”方法,并在三个 NLP 科学文献问答数据集上验证了其有效性。 + +##### **Object-centric Binding in Contrastive Language-Image Pretraining** +2502.14113v1 by Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano + +Recent advances in vision language models (VLM) have been driven by +contrastive models such as CLIP, which learn to associate visual information +with their corresponding text descriptions. However, these models have +limitations in understanding complex compositional scenes involving multiple +objects and their spatial relationships. To address these challenges, we +propose a novel approach that diverges from commonly used strategies, which +rely on the design of hard-negative augmentations. Instead, our work focuses on +integrating inductive biases into pre-trained CLIP-like models to improve their +compositional understanding without using any additional hard-negatives. To +that end, we introduce a binding module that connects a scene graph, derived +from a text description, with a slot-structured image representation, +facilitating a structured similarity assessment between the two modalities. We +also leverage relationships as text-conditioned visual constraints, thereby +capturing the intricate interactions between objects and their contextual +relationships more effectively. Our resulting model not only enhances the +performance of CLIP-based models in multi-object compositional understanding +but also paves the way towards more accurate and sample-efficient image-text +matching of complex scenes. + +摘要:最近视觉语言模型 (VLM) 的进步是由对比模型(例如 CLIP)推动的,该模型学习将视觉信息与其对应的文本描述联系起来。然而,这些模型在理解涉及多个对象及其空间关系的复杂组合场景方面存在局限性。为了应对这些挑战,我们提出了一种新颖的方法,它偏离了常用的策略,即依赖于硬负增强设计。相反,我们的工作重点是将归纳偏差集成到预训练的类似 CLIP 的模型中,以提高其组合理解能力,而无需使用任何其他硬否定。为此,我们引入了一个绑定模块,它将从文本描述中派生的场景图与槽结构图像表示连接起来,从而促进了两种模式之间的结构化相似性评估。我们还利用关系作为文本条件的视觉约束,从而更有效地捕捉对象及其上下文关系之间的复杂交互。我们由此产生的模型不仅增强了基于 CLIP 的模型在多对象组合理解中的性能,而且还为复杂场景的更准确和样本高效的图像文本匹配铺平了道路。 + +##### **Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning** +2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal + +Large language models (LLMs) have achieved remarkable performance in +generating human-like text and solving reasoning tasks of moderate complexity, +such as question-answering and mathematical problem-solving. However, their +capabilities in tasks requiring deeper cognitive skills, such as common-sense +understanding and abstract reasoning, remain under-explored. In this paper, we +systematically evaluate abstract common-sense reasoning in LLMs using the +ConceptNet knowledge graph. We propose two prompting approaches: instruct +prompting, where models predict plausible semantic relationships based on +provided definitions, and few-shot prompting, where models identify relations +using examples as guidance. Our experiments with the gpt-4o-mini model show +that in instruct prompting, consistent performance is obtained when ranking +multiple relations but with substantial decline when the model is restricted to +predicting only one relation. In few-shot prompting, the model's accuracy +improves significantly when selecting from five relations rather than the full +set, although with notable bias toward certain relations. These results suggest +significant gaps still, even in commercially used LLMs' abstract common-sense +reasoning abilities, compared to human-level understanding. However, the +findings also highlight the promise of careful prompt engineering, based on +selective retrieval, for obtaining better performance. + +摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。 + +##### **Neurosymbolic artificial intelligence via large language models and coherence-driven inference** +2502.13953v1 by Steve Huntsman, Jewell Thomas + +We devise an algorithm to generate sets of propositions that objectively +instantiate graphs that support coherence-driven inference. We then benchmark +the ability of large language models (LLMs) to reconstruct coherence graphs +from (a straightforward transformation of) propositions expressed in natural +language, with promising results from a single prompt to models optimized for +reasoning. Combining coherence-driven inference with consistency evaluations by +neural models may advance the state of the art in machine cognition. + +摘要:我們設計一種演算法,用來產生命題集合,以客觀地實例化支援連貫性驅動推論的圖形。接著,我們基準化大型語言模型 (LLM) 從以自然語言表達的命題(經過直接轉換)重建連貫性圖形的能力,結果顯示,單一提示就能從最佳化用於推理的模型中獲得有希望的結果。將連貫性驅動推論與神經模型的一致性評估結合起來,可能會提升機器認知的現有技術。 + +##### **Complex Ontology Matching with Large Language Model Embeddings** +2502.13619v1 by Guilherme Sousa, Rinaldo Lima, Cassia Trojahn + +Ontology, and more broadly, Knowledge Graph Matching is a challenging task in +which expressiveness has not been fully addressed. Despite the increasing use +of embeddings and language models for this task, approaches for generating +expressive correspondences still do not take full advantage of these models, in +particular, large language models (LLMs). This paper proposes to integrate LLMs +into an approach for generating expressive correspondences based on alignment +need and ABox-based relation discovery. The generation of correspondences is +performed by matching similar surroundings of instance sub-graphs. The +integration of LLMs results in different architectural modifications, including +label similarity, sub-graph matching, and entity matching. The performance word +embeddings, sentence embeddings, and LLM-based embeddings, was compared. The +results demonstrate that integrating LLMs surpasses all other models, enhancing +the baseline version of the approach with a 45\% increase in F-measure. + +摘要:本体论,更广泛地说,知识图谱匹配是一项具有挑战性的任务,其中表达力尚未得到充分解决。尽管越来越多地使用嵌入和语言模型来完成此任务,但生成表达性对应关系的方法仍然没有充分利用这些模型,特别是大型语言模型 (LLM)。本文提出将 LLM 集成到一种基于对齐需求和基于 ABox 的关系发现来生成表达性对应关系的方法中。对应关系的生成是通过匹配实例子图的相似周围环境来执行的。LLM 的集成导致了不同的架构修改,包括标签相似性、子图匹配和实体匹配。比较了单词嵌入、句子嵌入和基于 LLM 的嵌入的性能。结果表明,集成 LLM 超越了所有其他模型,通过 F-measure 提高了 45% 的基准版本的方法。 + +##### **Are Large Language Models In-Context Graph Learners?** +2502.13562v1 by Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Liang Chen, Zibin Zheng + +Large language models (LLMs) have demonstrated remarkable in-context +reasoning capabilities across a wide range of tasks, particularly with +unstructured inputs such as language or images. However, LLMs struggle to +handle structured data, such as graphs, due to their lack of understanding of +non-Euclidean structures. As a result, without additional fine-tuning, their +performance significantly lags behind that of graph neural networks (GNNs) in +graph learning tasks. In this paper, we show that learning on graph data can be +conceptualized as a retrieval-augmented generation (RAG) process, where +specific instances (e.g., nodes or edges) act as queries, and the graph itself +serves as the retrieved context. Building on this insight, we propose a series +of RAG frameworks to enhance the in-context learning capabilities of LLMs for +graph learning tasks. Comprehensive evaluations demonstrate that our proposed +RAG frameworks significantly improve LLM performance on graph-based tasks, +particularly in scenarios where a pretrained LLM must be used without +modification or accessed via an API. + +摘要:大型語言模型 (LLM) 在廣泛的任務中展示了非凡的語境推理能力,特別是對於語言或影像等非結構化輸入。然而,LLM 難以處理結構化資料,例如圖形,因為它們無法理解非歐幾何結構。因此,在沒有額外微調的情況下,它們在圖形學習任務中的表現遠遠落後於圖形神經網路 (GNN)。在本文中,我們展示了在圖形資料上學習可以被概念化為檢索增強生成 (RAG) 過程,其中特定實例(例如,節點或邊)充當查詢,而圖形本身則作為檢索的語境。基於這個見解,我們提出了一系列 RAG 架構,以增強 LLM 在圖形學習任務中的語境學習能力。全面的評估表明,我們提出的 RAG 架構顯著提升了 LLM 在基於圖形的任務上的表現,特別是在預訓練的 LLM 必須在不修改或透過 API 存取的情況下使用的場景中。 + +##### **Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs** +2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu + +Data augmentation is necessary for graph representation learning due to the +scarcity and noise present in graph data. Most of the existing augmentation +methods overlook the context information inherited from the dataset as they +rely solely on the graph structure for augmentation. Despite the success of +some large language model-based (LLM) graph learning methods, they are mostly +white-box which require access to the weights or latent features from the +open-access LLMs, making them difficult to be democratized for everyone as +existing LLMs are mostly closed-source for commercial considerations. To +overcome these limitations, we propose a black-box context-driven graph data +augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the +text prompt as context-related information, we task the LLM with generating +knowledge graphs (KGs), which allow us to capture the structural interactions +from the text outputs. We then design a dynamic merging schema to +stochastically integrate the LLM-generated KGs into the original graph during +training. To control the sparsity of the augmented graph, we further devise a +granularity-aware prompting strategy and an instruction fine-tuning module, +which seamlessly generates text prompts according to different granularity +levels of the dataset. Extensive experiments on various graph learning tasks +validate the effectiveness of our method over existing graph data augmentation +methods. Notably, our approach excels in scenarios involving electronic health +records (EHRs), which validates its maximal utilization of contextual +knowledge, leading to enhanced predictive performance and interpretability. + +摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。 + +##### **PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference** +2502.13502v1 by Burc Gokden + +We show that Large Language Model from Power Law Decoder Representations +(PLDR-LLM) is a foundational model whose deductive outputs are invariant +tensors up to a small perturbation. PLDR-LLM learns a singularity condition for +the deductive outputs that enable the once-inferred energy-curvature tensor +$\mathbf{G}_{LM}$ to replace the deep neural network of power law graph +attention (PLGA) generating the deductive outputs at inference. We demonstrate +that a cache for $\mathbf{G}_{LM}$ (G-cache) and KV-cache can be implemented in +a straightforward manner to improve the inference time. The invariance and +generalizable nature of deductive outputs is at a very high fidelity where +deductive outputs have same RMSE and determinant values up to 15 decimal places +after caching, and zero-shot benchmark scores remain unchanged. Ablation +studies show that learned deductive outputs have distinct loss and accuracy +characteristics from models pretrained with transferred, randomly initialized +or identity tensors as a constant tensor operator and an LLM with scaled-dot +product attention (SDPA) is a special case of PLDR-LLM where $\mathbf{G}_{LM}$ +is predefined as identity. The observed invariance characteristic introduces a +novel asymmetry between training and inference phases with caching. We outline +observed common characteristics of the deductive outputs for the learned +singularity condition. We provide an implementation of a training and inference +framework for PLDR-LLM with KV-cache and G-cache. + +摘要:我們展示了來自冪律解碼器表示 (PLDR-LLM) 的大型語言模型是一個基礎模型,其演繹輸出是直到一個小擾動的不變張量。PLDR-LLM 學習演繹輸出的奇異條件,使曾經推斷出的能量曲率張量 $\mathbf{G}_{LM}$ 能夠取代產生演繹輸出的冪律圖注意力 (PLGA) 深度神經網路,進行推論。我們證明了 $\mathbf{G}_{LM}$ 快取 (G 快取) 和 KV 快取能夠以一種直接的方式實作,以改善推論時間。演繹輸出的不變性和可概化性質具有非常高的保真度,其中演繹輸出在快取後具有相同的 RMSE 和行列式值,直到小數點後 15 位,且零次學習基準分數保持不變。消融研究表明,學習的演繹輸出具有與使用轉移、隨機初始化或恆等張量作為常數張量算子和具有縮放點積注意力的 LLM 預先訓練的模型不同的損失和準確性特徵,並且 $\mathbf{G}_{LM}$ 被預先定義為恆等的 PLDR-LLM 的一個特例,其中 $\mathbf{G}_{LM}$ 被預先定義為恆等。觀察到的不變特徵引入了訓練和推論階段之間一個新的不對稱性,並帶有快取。我們概述了學習的奇異條件演繹輸出的觀察到的共同特徵。我們提供了一個具有 KV 快取和 G 快取的 PLDR-LLM 訓練和推論框架的實作。 + +##### **Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction** +2502.13412v1 by Yanbang Sun, Qing Huang, Xiaoxue Ren, Zhenchang Xing, Xiaohong Li, Junjie Wang + +The API Knowledge Graph (API KG) is a structured network that models API +entities and their relations, providing essential semantic insights for tasks +such as API recommendation, code generation, and API misuse detection. However, +constructing a knowledge-rich and reliable API KG presents several challenges. +Existing schema-based methods rely heavily on manual annotations to design KG +schemas, leading to excessive manual overhead. On the other hand, schema-free +methods, due to the lack of schema guidance, are prone to introducing noise, +reducing the KG's reliability. To address these issues, we propose the +Explore-Construct-Filter framework, an automated approach for API KG +construction based on large language models (LLMs). This framework consists of +three key modules: 1) KG exploration: LLMs simulate the workflow of annotators +to automatically design a schema with comprehensive type triples, minimizing +human intervention; 2) KG construction: Guided by the schema, LLMs extract +instance triples to construct a rich yet unreliable API KG; 3) KG filtering: +Removing invalid type triples and suspicious instance triples to construct a +rich and reliable API KG. Experimental results demonstrate that our method +surpasses the state-of-the-art method, achieving a 25.2% improvement in F1 +score. Moreover, the Explore-Construct-Filter framework proves effective, with +the KG exploration module increasing KG richness by 133.6% and the KG filtering +module improving reliability by 26.6%. Finally, cross-model experiments confirm +the generalizability of our framework. + +摘要:API 知識圖譜 (API KG) 是一個結構化網路,用於建模 API 實體及其關係,提供基本語義見解,以執行 API 建議、程式碼產生和 API 誤用偵測等任務。然而,建構一個知識豐富且可靠的 API KG 會產生若干挑戰。現有的基於架構的方法嚴重依賴手動註解來設計 KG 架構,導致過度的手動開銷。另一方面,由於缺乏架構指導,無架構的方法容易引入雜訊,降低 KG 的可靠性。為了解決這些問題,我們提出了探索建構過濾架構,這是一種基於大型語言模型 (LLM) 的自動化 API KG 建構方法。此架構包含三個關鍵模組:1) KG 探索:LLM 模擬註解者的工作流程,自動設計具有完整類型三元組的架構,將人為干預降至最低;2) KG 建構:在架構的指導下,LLM 提取實例三元組來建構豐富但不可靠的 API KG;3) KG 過濾:移除無效的類型三元組和可疑的實例三元組,以建構豐富且可靠的 API KG。實驗結果表明,我們的方法優於最先進的方法,在 F1 分數上提高了 25.2%。此外,探索建構過濾架構被證明是有效的,其中 KG 探索模組將 KG 豐富度提高了 133.6%,而 KG 過濾模組將可靠性提高了 26.6%。最後,跨模型實驗證實了我們架構的泛化性。 + +##### **Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval** +2502.13369v1 by Aditya Sharma, Luis Lara, Amal Zouaq, Christopher J. Pal + +The ability to generate SPARQL queries from natural language questions is +crucial for ensuring efficient and accurate retrieval of structured data from +knowledge graphs (KG). While large language models (LLMs) have been widely +adopted for SPARQL query generation, they are often susceptible to +hallucinations and out-of-distribution errors when producing KG elements like +Uniform Resource Identifiers (URIs) based on internal parametric knowledge. +This often results in content that appears plausible but is factually +incorrect, posing significant challenges for their use in real-world +information retrieval (IR) applications. This has led to increased research +aimed at detecting and mitigating such errors. In this paper, we introduce PGMR +(Post-Generation Memory Retrieval), a modular framework that incorporates a +non-parametric memory module to retrieve KG elements and enhance LLM-based +SPARQL query generation. Our experimental results indicate that PGMR +consistently delivers strong performance across diverse datasets, data +distributions, and LLMs. Notably, PGMR significantly mitigates URI +hallucinations, nearly eliminating the problem in several scenarios. + +摘要:從自然語言問題中產生 SPARQL 查詢的能力對於確保從知識圖譜 (KG) 中有效率且準確地擷取結構化資料至關重要。儘管大型語言模型 (LLM) 已廣泛用於 SPARQL 查詢產生,但它們在根據內部參數化知識產生像統一資源識別碼 (URI) 等 KG 元素時,通常容易出現幻覺和分布外錯誤。這通常會導致內容看似合理,但事實上並不正確,對其在真實世界資訊檢索 (IR) 應用中的使用構成重大挑戰。這導致針對偵測和減輕此類錯誤的研究增加。在本文中,我們介紹 PGMR(後產生記憶體檢索),這是一個模組化架構,它結合了一個非參數記憶體模組來檢索 KG 元素並增強基於 LLM 的 SPARQL 查詢產生。我們的實驗結果表明,PGMR 在不同的資料集、資料分佈和 LLM 中始終提供強大的效能。值得注意的是,PGMR 大幅減輕了 URI 幻覺,在許多情況下幾乎消除了問題。 + +##### **Craw4LLM: Efficient Web Crawling for LLM Pretraining** +2502.13347v1 by Shi Yu, Zhiyuan Liu, Chenyan Xiong + +Web crawl is a main source of large language models' (LLMs) pretraining data, +but the majority of crawled web pages are discarded in pretraining due to low +data quality. This paper presents Crawl4LLM, an efficient web crawling method +that explores the web graph based on the preference of LLM pretraining. +Specifically, it leverages the influence of a webpage in LLM pretraining as the +priority score of the web crawler's scheduler, replacing the standard graph +connectivity based priority. Our experiments on a web graph containing 900 +million webpages from a commercial search engine's index demonstrate the +efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just +21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream +performances of previous crawls, significantly reducing the crawling waste and +alleviating the burdens on websites. Our code is publicly available at +https://github.com/cxcscmu/Crawl4LLM. + +摘要:網路爬蟲是大型語言模型 (LLM) 預訓練資料的主要來源, +但大多數已爬取的網頁在預訓練中會因為資料品質低落而被捨棄。 +本文提出 Crawl4LLM,這是一種有效率的網路爬取方法, +它會根據 LLM 預訓練的偏好來探索網路圖。 +具體來說,它利用網頁在 LLM 預訓練中的影響力作為網路爬蟲排程器的優先分數, +取代標準的圖形連線優先順序。 +我們在一個包含來自商業搜尋引擎索引的 9 億個網頁的網路圖上進行的實驗, +證明了 Crawl4LLM 在取得高品質預訓練資料方面的效率。 +只爬取了 21% 的網址,以 Crawl4LLM 資料預訓練的 LLM 就達到了先前爬取的相同下游效能, +大幅減少了爬取浪費,並減輕了對網站的負擔。 +我們的程式碼已公開於 https://github.com/cxcscmu/Crawl4LLM。 + +##### **K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction** +2502.13344v1 by Tassallah Abdullahi, Ioanna Gemou, Nihal V. Nayak, Ghulam Murtaza, Stephen H. Bach, Carsten Eickhoff, Ritambhara Singh + +Drug discovery is a complex and time-intensive process that requires +identifying and validating new therapeutic candidates. Computational approaches +using large-scale biomedical knowledge graphs (KGs) offer a promising solution +to accelerate this process. However, extracting meaningful insights from +large-scale KGs remains challenging due to the complexity of graph traversal. +Existing subgraph-based methods are tailored to graph neural networks (GNNs), +making them incompatible with other models, such as large language models +(LLMs). We introduce K-Paths, a retrieval framework that extracts structured, +diverse, and biologically meaningful paths from KGs. Integrating these paths +enables LLMs and GNNs to effectively predict unobserved drug-drug and +drug-disease interactions. Unlike traditional path-ranking approaches, K-Paths +retrieves and transforms paths into a structured format that LLMs can directly +process, facilitating explainable reasoning. K-Paths employs a diversity-aware +adaptation of Yen's algorithm to retrieve the K shortest loopless paths between +entities in an interaction query, prioritizing biologically relevant and +diverse relationships. Our experiments on benchmark datasets show that K-Paths +improves the zero-shot performance of Llama 8.1B's F1-score by 12.45 points on +drug repurposing and 13.42 points on interaction severity prediction. We also +show that Llama 70B achieves F1-score gains of 6.18 and 8.46 points, +respectively. K-Paths also improves the supervised training efficiency of +EmerGNN, a state-of-the-art GNN, by reducing KG size by 90% while maintaining +strong predictive performance. Beyond its scalability and efficiency, K-Paths +uniquely bridges the gap between KGs and LLMs, providing explainable rationales +for predicted interactions. These capabilities show that K-Paths is a valuable +tool for efficient data-driven drug discovery. + +摘要:藥物發現是一個複雜且耗時的過程,需要識別和驗證新的治療候選藥物。使用大型生物醫學知識圖譜 (KG) 的計算方法提供了一個有希望的解決方案來加速這個過程。然而,由於圖形遍歷的複雜性,從大型 KG 中提取有意義的見解仍然具有挑戰性。現有的子圖方法是針對圖神經網路 (GNN) 量身打造的,這使得它們與其他模型(例如大型語言模型 (LLM))不兼容。我們介紹了 K-Paths,這是一個檢索框架,它從 KG 中提取結構化、多樣化且具有生物意義的路徑。整合這些路徑使 LLM 和 GNN 能夠有效預測未觀察到的藥物-藥物和藥物-疾病交互。與傳統的路徑排序方法不同,K-Paths 檢索路徑並將其轉換為 LLM 可以直接處理的結構化格式,從而促進可解釋的推理。K-Paths 採用了 Yen 演算法的多樣性感知適應,以檢索交互查詢中實體之間的 K 個最短無環路徑,優先考慮生物相關且多樣化的關係。我們在基準資料集上的實驗表明,K-Paths 將 Llama 8.1B 的 F1 分數在藥物再利用上提高了 12.45 分,在交互嚴重性預測上提高了 13.42 分。我們還表明,Llama 70B 分別獲得了 6.18 分和 8.46 分的 F1 分數增益。K-Paths 還提高了最先進的 GNN EmerGNN 的監督訓練效率,同時將 KG 大小減少了 90%,同時保持強大的預測性能。除了其可擴展性和效率之外,K-Paths 獨特地彌合了 KG 和 LLM 之間的差距,為預測的交互提供了可解釋的依據。這些功能表明,K-Paths 是用於高效資料驅動藥物發現的寶貴工具。 + +##### **Grounding LLM Reasoning with Knowledge Graphs** +2502.13247v1 by Alfonso Amayuelas, Joy Sain, Simerjot Kaur, Charese Smiley + +Knowledge Graphs (KGs) are valuable tools for representing relationships +between entities in a structured format. Traditionally, these knowledge bases +are queried to extract specific information. However, question-answering (QA) +over such KGs poses a challenge due to the intrinsic complexity of natural +language compared to the structured format and the size of these graphs. +Despite these challenges, the structured nature of KGs can provide a solid +foundation for grounding the outputs of Large Language Models (LLMs), offering +organizations increased reliability and control. + Recent advancements in LLMs have introduced reasoning methods at inference +time to improve their performance and maximize their capabilities. In this +work, we propose integrating these reasoning strategies with KGs to anchor +every step or "thought" of the reasoning chains in KG data. Specifically, we +evaluate both agentic and automated search methods across several reasoning +strategies, including Chain-of-Thought (CoT), Tree-of-Thought (ToT), and +Graph-of-Thought (GoT), using GRBench, a benchmark dataset for graph reasoning +with domain-specific graphs. Our experiments demonstrate that this approach +consistently outperforms baseline models, highlighting the benefits of +grounding LLM reasoning processes in structured KG data. + +摘要:知識圖譜 (KG) 是以結構化格式表示實體之間關係的寶貴工具。傳統上,這些知識庫會被查詢以萃取特定資訊。然而,由於自然語言與結構化格式之間的內在複雜性,以及這些圖譜的規模,在這些 KG 上進行問答 (QA) 會構成挑戰。儘管有這些挑戰,KG 的結構化特性可以為大型語言模型 (LLM) 的輸出提供穩固的基礎,為組織提供更高的可靠性和控制力。 +LLM 的最新進展在推論時間引入了推理方法,以提升其效能並最大化其能力。在這項工作中,我們建議將這些推理策略與 KG 整合,以將推理鏈的每一步或「思考」錨定在 KG 資料中。具體來說,我們在多種推理策略中評估代理和自動化搜尋方法,包括思考鏈 (CoT)、思考樹 (ToT) 和思考圖 (GoT),使用 GRBench,這是一個針對圖形推理的基準資料集,其中包含特定領域的圖形。我們的實驗證明,這種方法始終優於基準模型,突顯了將 LLM 推理過程建立在結構化 KG 資料中的好處。 + +##### **Learning to Defer for Causal Discovery with Imperfect Experts** +2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin + +Integrating expert knowledge, e.g. from large language models, into causal +discovery algorithms can be challenging when the knowledge is not guaranteed to +be correct. Expert recommendations may contradict data-driven results, and +their reliability can vary significantly depending on the domain or specific +query. Existing methods based on soft constraints or inconsistencies in +predicted causal relationships fail to account for these variations in +expertise. To remedy this, we propose L2D-CD, a method for gauging the +correctness of expert recommendations and optimally combining them with +data-driven causal discovery results. By adapting learning-to-defer (L2D) +algorithms for pairwise causal discovery (CD), we learn a deferral function +that selects whether to rely on classical causal discovery methods using +numerical data or expert recommendations based on textual meta-data. We +evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its +superior performance compared to both the causal discovery method and the +expert used in isolation. Moreover, our approach identifies domains where the +expert's performance is strong or weak. Finally, we outline a strategy for +generalizing this approach to causal discovery on graphs with more than two +variables, paving the way for further research in this area. + +摘要:整合专家知識,例如從大型語言模型中整合到因果發現演算法中,當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾,而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點,我們提出了 L2D-CD,一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD),我們學習了一個延遲函數,用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD,並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外,我們的做法識別出專家表現強或弱的領域。最後,我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略,為此領域的進一步研究鋪平了道路。 + +##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks** +2502.13025v1 by Markus J. Buehler + +We present an agentic, autonomous graph expansion framework that iteratively +structures and refines knowledge in situ. Unlike conventional knowledge graph +construction methods relying on static extraction or single-pass learning, our +approach couples a reasoning-native large language model with a continually +updated graph representation. At each step, the system actively generates new +concepts and relationships, merges them into a global graph, and formulates +subsequent prompts based on its evolving structure. Through this +feedback-driven loop, the model organizes information into a scale-free network +characterized by hub formation, stable modularity, and bridging nodes that link +disparate knowledge clusters. Over hundreds of iterations, new nodes and edges +continue to appear without saturating, while centrality measures and shortest +path distributions evolve to yield increasingly distributed connectivity. Our +analysis reveals emergent patterns, such as the rise of highly connected 'hub' +concepts and the shifting influence of 'bridge' nodes, indicating that agentic, +self-reinforcing graph construction can yield open-ended, coherent knowledge +structures. Applied to materials design problems, we present compositional +reasoning experiments by extracting node-specific and synergy-level principles +to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that +transcend rote summarization and strengthen the framework's potential for +open-ended scientific discovery. We discuss other applications in scientific +discovery and outline future directions for enhancing scalability and +interpretability. + +摘要:我們提出一個能動的、自主的圖形擴展框架,它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同,我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中,系統主動產生新的概念和關係,將它們合併到一個全域圖形中,並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈,模型將資訊組織成一個無標度網路,其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中,新的節點和邊緣會持續出現,而不會飽和,同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式,例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移,這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題,我們提出組合推理實驗,透過提取特定於節點的原則和協同效應層級原則,以促進真正新穎的知識綜合,產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用,並概述了增強可擴充性和可解釋性的未來方向。 + +##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge** +2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany + +Large Language Models (LLMs) have significantly advanced medical +question-answering by leveraging extensive clinical data and medical +literature. However, the rapid evolution of medical knowledge and the +labor-intensive process of manually updating domain-specific resources pose +challenges to the reliability of these systems. To address this, we introduce +Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates +the construction and continuous updating of medical knowledge graphs, +integrates reasoning, and retrieves current external evidence, such as PubMed +and WikiSearch. By dynamically linking new findings and complex medical +concepts, AMG-RAG not only improves accuracy but also enhances interpretability +in medical queries. + Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness +of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of +66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to +100 times larger. Notably, these improvements are achieved without increasing +computational overhead, highlighting the critical role of automated knowledge +graph generation and external evidence retrieval in delivering up-to-date, +trustworthy medical insights. + +摘要:大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻,大幅提升了醫療問題解答的進步。然而,醫療知識的快速演進和手動更新特定領域資源的繁複程序,對這些系統的可靠性構成挑戰。為了解決這個問題,我們引入了適應性醫療圖表 RAG (AMG-RAG),這是一個自動化建構和持續更新醫療知識圖表的綜合架構,整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念,AMG-RAG 不僅提升了準確性,也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性,在 MEDQA 上達到了 74.1% 的 F1 分數,在 MEDMCQA 上達到了 66.34% 的準確度,優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是,這些改進是在不增加運算負擔的情況下實現的,突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。 + +##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs** +2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi + +Recent studies have combined Large Language Models (LLMs) with Knowledge +Graphs (KGs) to enhance reasoning, improving inference accuracy without +additional training while mitigating hallucination. However, existing +frameworks are often rigid, struggling to adapt to KG or task changes. They +also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. +To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that +separates reasoning into two roles: an Operator (a low-capacity LLM) that +gathers evidence and a Supervisor (a high-capacity LLM) that makes final +judgments. This design is cost-efficient for LLM inference while still +maintaining strong reasoning accuracy. Additionally, R2-KG employs an +Abstention mechanism, generating answers only when sufficient evidence is +collected from KG, which significantly enhances reliability. Experiments across +multiple KG-based reasoning tasks show that R2-KG consistently outperforms +baselines in both accuracy and reliability, regardless of the inherent +capability of LLMs used as the Operator. Further experiments reveal that the +single-agent version of R2-KG, equipped with a strict self-consistency +strategy, achieves significantly higher-than-baseline reliability while +reducing inference cost. However, it also leads to a higher abstention rate in +complex KGs. Our findings establish R2-KG as a flexible and cost-effective +solution for KG-based reasoning. It reduces reliance on high-capacity LLMs +while ensuring trustworthy inference. + +摘要:最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理,在不额外训练的情况下提高推理准确性,同时减轻幻觉。然而,现有的框架通常很僵化,难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠(即值得信赖)的推理。为了解决这个问题,我们引入了 R2-KG,这是一个即插即用、双代理框架,它将推理分为两个角色:一个收集证据的操作员(低容量 LLM)和一个做出最终判断的监督员(高容量 LLM)。这种设计在 LLM 推理方面具有成本效益,同时仍保持强大的推理准确性。此外,R2-KG 采用弃权机制,仅在从知识图谱收集到足够证据时才生成答案,这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明,R2-KG 在准确性和可靠性方面始终优于基线,而与用作操作员的 LLM 的固有能力无关。进一步的实验表明,R2-KG 的单代理版本配备了严格的自一致性策略,实现了明显高于基线的可靠性,同时降低了推理成本。然而,它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖,同时确保了可信的推理。 + +##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research** +2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang + +The rapid advancement of perovskite solar cells (PSCs) has led to an +exponential growth in research publications, creating an urgent need for +efficient knowledge management and reasoning systems in this domain. We present +a comprehensive knowledge-enhanced system for PSCs that integrates three key +components. First, we develop Perovskite-KG, a domain-specific knowledge graph +constructed from 1,517 research papers, containing 23,789 entities and 22,272 +relationships. Second, we create two complementary datasets: Perovskite-Chat, +comprising 55,101 high-quality question-answer pairs generated through a novel +multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully +curated materials science problems. Third, we introduce two specialized large +language models: Perovskite-Chat-LLM for domain-specific knowledge assistance +and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental +results demonstrate that our system significantly outperforms existing models +in both domain-specific knowledge retrieval and scientific reasoning tasks, +providing researchers with effective tools for literature review, experimental +design, and complex problem-solving in PSC research. + +摘要:由於 perovskite 太陽能電池 (PSC) 快速進展,導致研究出版物呈指數成長,迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先,我們開發出 Perovskite-KG,一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次,我們建立兩個互補的資料集:Perovskite-Chat,包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對;以及 Perovskite-Reasoning,包含 2,217 個仔細策展的材料科學問題。第三,我們推出兩個專門化大型語言模型:針對領域特定知識協助的 Perovskite-Chat-LLM,以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示,我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型,為研究人員提供有效的工具,用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。 + +##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation** +2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li + +Explainable recommendation has demonstrated significant advantages in +informing users about the logic behind recommendations, thereby increasing +system transparency, effectiveness, and trustworthiness. To provide +personalized and interpretable explanations, existing works often combine the +generation capabilities of large language models (LLMs) with collaborative +filtering (CF) information. CF information extracted from the user-item +interaction graph captures the user behaviors and preferences, which is crucial +for providing informative explanations. However, due to the complexity of graph +structure, effectively extracting the CF information from graphs still remains +a challenge. Moreover, existing methods often struggle with the integration of +extracted CF information with LLMs due to its implicit representation and the +modality gap between graph structures and natural language explanations. To +address these challenges, we propose G-Refer, a framework using graph +retrieval-augmented large language models (LLMs) for explainable +recommendation. Specifically, we first employ a hybrid graph retrieval +mechanism to retrieve explicit CF signals from both structural and semantic +perspectives. The retrieved CF information is explicitly formulated as +human-understandable text by the proposed graph translation and accounts for +the explanations generated by LLMs. To bridge the modality gap, we introduce +knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of +LLMs to process and utilize the retrieved CF information to generate +explanations. Extensive experiments show that G-Refer achieves superior +performance compared with existing methods in both explainability and +stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer. + +摘要:可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點,從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明,現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好,這對於提供資訊性說明至關重要。然而,由於圖形結構的複雜性,從圖形中有效提取 CF 資訊仍然是一個挑戰。此外,現有方法通常難以將提取的 CF 資訊與 LLM 整合,因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰,我們提出 G-Refer,一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說,我們首先採用混合圖形檢索機制,從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字,並說明 LLM 生成的解釋。為了彌合模式差距,我們引入了知識修剪和檢索增強微調,以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明,與現有方法相比,G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。 + +##### **A-MEM: Agentic Memory for LLM Agents** +2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang + +While large language model (LLM) agents can effectively use external tools +for complex real-world tasks, they require memory systems to leverage +historical experiences. Current memory systems enable basic storage and +retrieval but lack sophisticated memory organization, despite recent attempts +to incorporate graph databases. Moreover, these systems' fixed operations and +structures limit their adaptability across diverse tasks. To address this +limitation, this paper proposes a novel agentic memory system for LLM agents +that can dynamically organize memories in an agentic way. Following the basic +principles of the Zettelkasten method, we designed our memory system to create +interconnected knowledge networks through dynamic indexing and linking. When a +new memory is added, we generate a comprehensive note containing multiple +structured attributes, including contextual descriptions, keywords, and tags. +The system then analyzes historical memories to identify relevant connections, +establishing links where meaningful similarities exist. Additionally, this +process enables memory evolution - as new memories are integrated, they can +trigger updates to the contextual representations and attributes of existing +historical memories, allowing the memory network to continuously refine its +understanding. Our approach combines the structured organization principles of +Zettelkasten with the flexibility of agent-driven decision making, allowing for +more adaptive and context-aware memory management. Empirical experiments on six +foundation models show superior improvement against existing SOTA baselines. +The source code is available at https://github.com/WujiangXu/AgenticMemory. + +摘要:大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務,但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索,但缺乏精密的記憶體組織,儘管最近嘗試納入圖形資料庫。此外,這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制,本文提出了一種新的代理記憶體系統,供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則,我們設計我們的記憶體系統,透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時,我們會產生包含多個結構化屬性的綜合筆記,包括脈絡描述、關鍵字和標籤。然後,系統會分析歷史記憶體以找出相關連結,在有意義的相似性時建立連結。此外,這個程序能讓記憶體演化,因為當整合新的記憶體時,它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新,讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性,能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。 + +##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs** +2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li + +Large language models (LLMs) have demonstrated remarkable capabilities in +various complex tasks, yet they still suffer from hallucinations. Introducing +external knowledge, such as knowledge graph, can enhance the LLMs' ability to +provide factual answers. LLMs have the ability to interactively explore +knowledge graphs. However, most approaches have been affected by insufficient +internal knowledge excavation in LLMs, limited generation of trustworthy +knowledge reasoning paths, and a vague integration between internal and +external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large +model framework driven by the collaboration of internal and external knowledge. +It relies on the internal knowledge of the LLM to guide the exploration of +interpretable directed subgraphs in external knowledge graphs, better +integrating the two knowledge sources for more accurate reasoning. Extensive +experiments on multiple real-world datasets confirm the superiority of +KnowPath. + +摘要:大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力,但仍會出現幻覺。引入外部知識(例如知識圖譜)可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而,大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限,以及內部和外部知識之間的整合模糊的影響。因此,我們提出 KnowPath,這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索,更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。 + +##### **Atom of Thoughts for Markov LLM Test-Time Scaling** +2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo + +Large Language Models (LLMs) achieve superior performance through +training-time scaling, and test-time scaling further enhances their +capabilities by conducting effective reasoning during inference. However, as +the scale of reasoning increases, existing test-time scaling methods suffer +from accumulated historical information, which not only wastes computational +resources but also interferes with effective reasoning. To address this issue, +we observe that complex reasoning progress is often achieved by solving a +sequence of independent subquestions, each being self-contained and verifiable. +These subquestions are essentially atomic questions, relying primarily on their +current state rather than accumulated history, similar to the memoryless +transitions in a Markov process. Based on this observation, we propose Atom of +Thoughts (AoT), where each state transition in the reasoning process consists +of decomposing the current question into a dependency-based directed acyclic +graph and contracting its subquestions, forming a new atomic question state. +This iterative decomposition-contraction process continues until reaching +directly solvable atomic questions, naturally realizing Markov transitions +between question states. Furthermore, these atomic questions can be seamlessly +integrated into existing test-time scaling methods, enabling AoT to serve as a +plug-in enhancement for improving reasoning capabilities. Experiments across +six benchmarks demonstrate the effectiveness of AoT both as a standalone +framework and a plug-in enhancement. Notably, on HotpotQA, when applied to +gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and +DeepSeek-R1 by 10.6%. The code will be available at +https://github.com/qixucen/atom. + +摘要:大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能,而測試時間擴充透過在推論期間進行有效的推理,進一步提升其能力。然而,隨著推理規模的擴大,現有的測試時間擴充方法會受到累積的歷史資訊影響,這不僅會浪費運算資源,還會干擾有效的推理。為了解決這個問題,我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成,每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題,主要依賴於它們的當前狀態,而不是累積的歷史,類似於馬可夫過程中的無記憶轉換。基於這個觀察,我們提出了思想原子 (AoT),其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖,並收縮其子問題,形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行,直到達到可直接解決的原子問題,自然地實現問題狀態之間的馬可夫轉換。此外,這些原子問題可以無縫整合到現有的測試時間擴充方法中,讓 AoT 可以作為外掛程式強化功能,以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是,在 HotpotQA 上,當應用於 gpt-4o-mini 時,AoT 達到了 80.6% 的 F1 分數,比 o3-mini 高出 3.4%,比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。 + +##### **Generating Text from Uniform Meaning Representation** +2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein + +Uniform Meaning Representation (UMR) is a recently developed graph-based +semantic representation, which expands on Abstract Meaning Representation (AMR) +in a number of ways, in particular through the inclusion of document-level +information and multilingual flexibility. In order to effectively adopt and +leverage UMR for downstream tasks, efforts must be placed toward developing a +UMR technological ecosystem. Though still limited amounts of UMR annotations +have been produced to date, in this work, we investigate the first approaches +to producing text from multilingual UMR graphs: (1) a pipeline conversion of +UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large +language models with UMR data, and (3) fine-tuning existing AMR-to-text +generation models with UMR data. Our best performing model achieves a +multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared +to the reference, which is a promising indication of the effectiveness of +fine-tuning approaches for UMR-to-text generation with even limited amounts of +UMR data. + +摘要:統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示,它在許多方面擴展了抽象語意表示 (AMR),特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR,必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限,但在這項工作中,我們探討了從多語言 UMR 圖形產生文字的第一種方法:(1) 將 UMR 轉換為 AMR 的管道,然後使用 AMR 轉文字生成模型,(2) 使用 UMR 資料微調大型語言模型,以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比,我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數,在中文中達到 0.882,這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果,即使 UMR 資料數量有限。 + +##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs** +2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han + +The rapid development of Multimodal Large Language Models (MLLMs) has enabled +the integration of multiple modalities, including texts and images, within the +large language model (LLM) framework. However, texts and images are usually +interconnected, forming a multimodal attributed graph (MMAG). It is +underexplored how MLLMs can incorporate the relational information +(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts +and images) on such graphs for multimodal comprehension and generation. In this +paper, we propose GraphGPT-o, which supports omni-multimodal understanding and +creation on MMAGs. We first comprehensively study linearization variants to +transform semantic and structural information as input for MLLMs. Then, we +propose a hierarchical aligner that enables deep graph encoding, bridging the +gap between MMAGs and MLLMs. Finally, we explore the inference choices, +adapting MLLM to interleaved text and image generation in graph scenarios. +Extensive experiments on three datasets from different domains demonstrate the +effectiveness of our proposed method. Datasets and codes will be open-sourced +upon acceptance. + +摘要:多模态大语言模型 (MLLM) 的快速发展,促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而,文本和图像通常是相互关联的,形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息(即图结构)和语义信息(即文本和图像)以进行多模态理解和生成,目前仍未得到充分探索。在本文中,我们提出了 GraphGPT-o,它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体,以将语义和结构信息转换为 MLLM 的输入。然后,我们提出了一个分层对齐器,它支持深度图编码,弥合了 MMAG 和 MLLM 之间的差距。最后,我们探索了推理选择,使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。 + +##### **Exploring LLM-based Student Simulation for Metacognitive Cultivation** +2502.11678v1 by Haoxuan Li, Jifan Yu, Xin Cong, Yang Dang, Yisi Zhan, Huiqin Liu, Zhiyuan Liu + +Metacognitive education plays a crucial role in cultivating students' +self-regulation and reflective thinking, providing essential support for those +with learning difficulties through academic advising. Simulating students with +insufficient learning capabilities using large language models offers a +promising approach to refining pedagogical methods without ethical concerns. +However, existing simulations often fail to authentically represent students' +learning struggles and face challenges in evaluation due to the lack of +reliable metrics and ethical constraints in data collection. To address these +issues, we propose a pipeline for automatically generating and filtering +high-quality simulated student agents. Our approach leverages a two-round +automated scoring system validated by human experts and employs a score +propagation module to obtain more consistent scores across the student graph. +Experimental results demonstrate that our pipeline efficiently identifies +high-quality student agents, and we discuss the traits that influence the +simulation's effectiveness. By simulating students with varying degrees of +learning difficulties, our work paves the way for broader applications in +personalized learning and educational assessment. + +摘要:元認知教育在培養學生的自我調節和反思性思考中發揮著至關重要的作用,通過學術諮詢為有學習困難的人提供必要的支持。使用大型語言模型模擬學習能力不足的學生提供了一種有前途的方法,可以在沒有道德問題的情況下改進教學方法。然而,現有的模擬通常無法真實地反映學生的學習困難,並且由於缺乏可靠的指標和數據收集中的道德約束,在評估中面臨挑戰。為了解決這些問題,我們提出了一個自動生成和過濾高質量模擬學生代理的管道。我們的做法利用了由人類專家驗證的兩輪自動評分系統,並採用分數傳播模組來獲得跨學生圖表更一致的分數。實驗結果表明,我們的管道有效地識別了高質量的學生代理,並且我們討論了影響模擬效果的特質。通過模擬具有不同程度學習困難的學生,我們的研究為個性化學習和教育評估中的更廣泛應用鋪平了道路。 + +##### **Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering** +2502.11491v1 by Runxuan Liu, Bei Luo, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin + +Large language models (LLMs) have shown remarkable capabilities in natural +language processing. However, in knowledge graph question answering tasks +(KGQA), there remains the issue of answering questions that require multi-hop +reasoning. Existing methods rely on entity vector matching, but the purpose of +the question is abstract and difficult to match with specific entities. As a +result, it is difficult to establish reasoning paths to the purpose, which +leads to information loss and redundancy. To address this issue, inspired by +human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a +novel framework that constructs reasoning paths from purposes back to +conditions. ORT operates in three key phases: (1) using LLM to extract purpose +labels and condition labels, (2) constructing label reasoning paths based on +the KG ontology, and (3) using the label reasoning paths to guide knowledge +retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves +state-of-the-art performance and significantly enhances the capability of LLMs +for KGQA. + +摘要:大型語言模型 (LLM) 在自然語言處理中展現出卓越的能力。然而,在知識圖譜問答任務 (KGQA) 中,仍然存在需要多跳推理才能回答問題的問題。現有方法依賴於實體向量匹配,但問題的目的是抽象的,難以與特定實體匹配。因此,很難建立推理路徑來達成目的,這會導致資訊遺失和冗餘。為了解決這個問題,在人類逆向思維的啟發下,我們提出了基於本体的逆向思維 (ORT),這是一個創新的架構,可以從目的建構推理路徑,再回推到條件。ORT 運作在三個關鍵階段:(1) 使用 LLM 萃取目的標籤和條件標籤,(2) 基於 KG 本体建構標籤推理路徑,以及 (3) 使用標籤推理路徑來引導知識擷取。在 WebQSP 和 CWQ 資料集上的實驗顯示,ORT 達到了最先進的效能,並顯著增強了 LLM 對 KGQA 的能力。 + +##### **GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion** +2502.11471v1 by Kangyang Luo, Yuzhuo Bai, Cheng Gao, Shuzheng Si, Yingli Shen, Zhu Liu, Zhitong Wang, Cunliang Kong, Wenhao Li, Yufei Huang, Ye Tian, Xuantang Xiong, Lei Han, Maosong Sun + +Knowledge Graph Completion (KGC), which aims to infer missing or incomplete +facts, is a crucial task for KGs. However, integrating the vital structural +information of KGs into Large Language Models (LLMs) and outputting predictions +deterministically remains challenging. To address this, we propose a new method +called GLTW, which encodes the structural information of KGs and merges it with +LLMs to enhance KGC performance. Specifically, we introduce an improved Graph +Transformer (iGT) that effectively encodes subgraphs with both local and global +structural information and inherits the characteristics of language model, +bypassing training from scratch. Also, we develop a subgraph-based +multi-classification training objective, using all entities within KG as +classification objects, to boost learning efficiency.Importantly, we combine +iGT with an LLM that takes KG language prompts as input.Our extensive +experiments on various KG datasets show that GLTW achieves significant +performance gains compared to SOTA baselines. + +摘要:知識圖譜補全 (KGC) 旨在推論遺失或不完整的 +事實,是 KGs 的一項關鍵任務。然而,將 KGs 的重要結構 +資訊整合至大型語言模型 (LLM),並確定性地輸出預測結果,仍然是一項挑戰。為了解決這個問題,我們提出了一種新的方法,稱為 GLTW,它編碼了 KGs 的結構資訊,並將其與 LLM 合併,以增強 KGC 的效能。具體來說,我們引進了一個改良的圖形轉換器 (iGT),它能有效地編碼具有局部和全域結構資訊的子圖,並繼承語言模型的特徵,繞過從頭開始的訓練。此外,我們開發了一個基於子圖的多分類訓練目標,使用 KG 中的所有實體作為 +分類物件,以提升學習效率。重要的是,我們將 iGT 與一個將 KG 語言提示作為輸入的 LLM 結合起來。我們在各種 KG 資料集上進行的廣泛實驗顯示,與 SOTA 基準線相比,GLTW 獲得了顯著的效能提升。 + +##### **Large Language-Geometry Model: When LLM meets Equivariance** +2502.11149v2 by Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao + +Accurately predicting 3D structures and dynamics of physical systems is +crucial in scientific applications. Existing approaches that rely on geometric +Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, +but they often fall in leveraging extensive broader information. While direct +application of Large Language Models (LLMs) can incorporate external knowledge, +they lack the capability for spatial reasoning with guaranteed equivariance. In +this paper, we propose EquiLLM, a novel framework for representing 3D physical +systems that seamlessly integrates E(3)-equivariance with LLM capabilities. +Specifically, EquiLLM comprises four key components: geometry-aware prompting, +an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the +LLM guided by the instructive prompt serves as a sophisticated invariant +feature processor, while 3D directional information is exclusively handled by +the equivariant encoder and adaptor modules. Experimental results demonstrate +that EquiLLM delivers significant improvements over previous methods across +molecular dynamics simulation, human motion simulation, and antibody design, +highlighting its promising generalizability. + +摘要:準確預測物理系統的 3D 結構和動力學在科學應用中至關重要。現有依賴於幾何圖神經網路 (GNN) 的方法有效地強制執行了 $\mathrm{E}(3)$-等變性,但它們通常無法利用廣泛的更廣泛資訊。儘管大型語言模型 (LLM) 的直接應用可以納入外部知識,但它們缺乏保證等變性的空間推理能力。在本文中,我們提出了 EquiLLM,一個用於表示 3D 物理系統的新框架,它將 E(3)-等變性與 LLM 能力無縫整合。具體來說,EquiLLM 包含四個關鍵組成部分:感知幾何的提示、等變編碼器、LLM 和等變適配器。從本質上講,由指導性提示引導的 LLM 作為一個複雜的不變特徵處理器,而 3D 方向資訊則由等變編碼器和適配器模組獨家處理。實驗結果表明,EquiLLM 在分子動力學模擬、人類運動模擬和抗體設計方面比以前的方法有了顯著的改進,突顯了其有希望的泛化能力。 + +##### **Beyond Pairwise: Global Zero-shot Temporal Graph Generation** +2502.11114v1 by Alon Eirew, Kfir Bar, Ido Dagan + +Temporal relation extraction (TRE) is a fundamental task in natural language +processing (NLP) that involves identifying the temporal relationships between +events in a document. Despite the advances in large language models (LLMs), +their application to TRE remains limited. Most existing approaches rely on +pairwise classification, in which event pairs are considered individually, +leading to computational inefficiency and a lack of global consistency in the +resulting temporal graph. In this work, we propose a novel zero-shot method for +TRE that generates a document's complete temporal graph at once, then applies +transitive constraints optimization to refine predictions and enforce temporal +consistency across relations. Additionally, we introduce OmniTemp, a new +dataset with complete annotations for all pairs of targeted events within a +document. Through experiments and analyses, we demonstrate that our method +significantly outperforms existing zero-shot approaches while achieving +competitive performance with supervised models. + +摘要:時間關係抽取 (TRE) 是自然語言處理 (NLP) 中的一項基本任務,涉及識別文件中事件之間的時間關係。儘管大型語言模型 (LLM) 取得進展,但它們在 TRE 中的應用仍然有限。現有的大多數方法依賴於成對分類,其中事件對被單獨考慮,導致計算效率低下且在生成的時序圖中缺乏全局一致性。在這項工作中,我們提出了一種新穎的 TRE 零次學習方法,它可以一次生成文件的完整時序圖,然後應用遞移約束最佳化來優化預測並強制關係之間的時間一致性。此外,我們引入了 OmniTemp,這是一個新的數據集,其中包含文件內所有目標事件對的完整註解。通過實驗和分析,我們證明了我們的方法明顯優於現有的零次學習方法,同時實現了與監督模型相當的性能。 + +##### **Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications** +2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy + +Large language models (LLMs) have significantly advanced the field of natural +language generation. However, they frequently generate unverified outputs, +which compromises their reliability in critical applications. In this study, we +propose an innovative framework that combines structured biomedical knowledge +with LLMs through a retrieval-augmented generation technique. Our system +develops a thorough knowledge graph by identifying and refining causal +relationships and named entities from medical abstracts related to age-related +macular degeneration (AMD). Using a vector-based retrieval process and a +locally deployed language model, our framework produces responses that are both +contextually relevant and verifiable, with direct references to clinical +evidence. Experimental results show that this method notably decreases +hallucinations, enhances factual precision, and improves the clarity of +generated responses, providing a robust solution for advanced biomedical +chatbot applications. + +摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。 + +##### **Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection** +2502.11062v1 by Yang Zhao, Li Du, Xiao Ding, Yangou Ouyang, Hepeng Wang, Kai Xiong, Jinglong Gao, Zhouhao Sun, Dongliang Xu, Yang Qing, Dongchen Li, Bing Qin, Ting Liu + +Large language models (LLMs) have shown great potential across various +industries due to their remarkable ability to generalize through instruction +tuning. However, the limited availability of domain-specific data significantly +hampers their performance on specialized tasks. While existing methods +primarily focus on selecting training data from general datasets that are +similar to the target domain, they often fail to consider the joint +distribution of instructions, resulting in inefficient learning and suboptimal +knowledge transfer. To address these challenges, we introduce G2IS +(Gradient-based Graph Instruction Selection), a novel method that constructs a +mixed gradient-based instruction graph to capture the joint distribution and +interdependencies between instructions. By accounting for the relationships +between instructions, G2IS improves domain adaptation efficiency. Additionally, +we propose a gradient walk algorithm to refine the data selection process, +enhancing both training effectiveness and efficiency. Our experiments +demonstrate that G2IS outperforms traditional methods across various domain +adaptation tasks, yielding significant performance gains, particularly in +complex, data-scarce scenarios. These results underscore the potential of G2IS +in advancing the development of large, domain-specific models. + +摘要:大型語言模型 (LLM) 因其透過指令微調而具備的卓越泛化能力,在各產業中展現出極大的潛力。然而,特定領域資料的取得有限,大幅影響其在專業任務上的表現。現有方法主要專注於從與目標領域類似的通用資料集中選取訓練資料,但它們通常未能考量指令的聯合分佈,導致學習效率不彰且知識傳遞不佳。為了應對這些挑戰,我們引進 G2IS(基於梯度的圖形指令選取),這是一種創新的方法,可建構一個混合的基於梯度的指令圖形,以擷取指令之間的聯合分佈和相互依賴性。透過考量指令之間的關係,G2IS 提升了領域適應的效率。此外,我們提出了一種梯度漫步演算法來優化資料選取程序,同時提升訓練效能和效率。我們的實驗證明,G2IS 在各種領域適應任務中優於傳統方法,產生顯著的效能提升,特別是在資料稀少的複雜場景中。這些結果突顯了 G2IS 在推動大型特定領域模型發展方面的潛力。 + +##### **CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models** +2502.11008v1 by Yuefei Chen, Vivek K. Singh, Jing Ma, Ruxiang Tang + +Counterfactual reasoning is widely recognized as one of the most challenging +and intricate aspects of causality in artificial intelligence. In this paper, +we evaluate the performance of large language models (LLMs) in counterfactual +reasoning. In contrast to previous studies that primarily focus on commonsense +causal reasoning, where LLMs often rely on prior knowledge for inference, we +specifically assess their ability to perform counterfactual inference using a +set of formal rules. To support this evaluation, we introduce a new benchmark +dataset, CounterBench, comprising 1K counterfactual reasoning questions. The +dataset is designed with varying levels of difficulty, diverse causal graph +structures, distinct types of counterfactual questions, and multiple +nonsensical name variants. Our experiments demonstrate that counterfactual +reasoning poses a significant challenge for LLMs, with most models performing +at levels comparable to random guessing. To enhance LLM's counterfactual +reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides +LLMs through iterative reasoning and backtracking to systematically explore +counterfactual solutions. Experimental results show that our method +significantly improves LLM performance on counterfactual reasoning tasks and +consistently enhances performance across different LLMs.Our dataset is +available at https://huggingface.co/datasets/CounterBench/CounterBench. + +摘要:反事實推理被廣泛認為是人工智慧中因果關係最具挑戰性和複雜的面向之一。在本文中,我們評估大型語言模型 (LLM) 在反事實推理中的表現。與主要關注常識因果推理,其中 LLM 經常依賴先驗知識來進行推理的先前研究不同,我們特別評估它們使用一組形式規則執行反事實推理的能力。為了支持此評估,我們引入了一個新的基準資料集 CounterBench,其中包含 1K 個反事實推理問題。資料集的設計具有不同的難度等級、多樣化的因果圖結構、不同類型的反事實問題和多種無意義的名稱變體。我們的實驗表明,反事實推理對 LLM 構成重大挑戰,大多數模型的表現與隨機猜測相當。為了增強 LLM 的反事實推理能力,我們提出了一種新穎的推理範例 CoIn,它引導 LLM 透過反覆推理和回溯系統性地探索反事實解。實驗結果表明,我們的方法顯著提升 LLM 在反事實推理任務上的表現,並持續增強不同 LLM 的表現。我們的資料集可在 https://huggingface.co/datasets/CounterBench/CounterBench 取得。 + +##### **RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation** +2502.10996v1 by Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han + +Retrieval-augmented language models often struggle with knowledge-intensive +tasks due to inefficient retrieval, unstructured knowledge integration, and +single-pass architectures. We present Retrieval-And-Structuring (RAS), a novel +framework that dynamically constructs and reasons over query-specific knowledge +graphs through iterative retrieval and structuring. RAS introduces four key +technical innovations: (1) a themescoped retrieval mechanism that efficiently +narrows the search space while maintaining retrieval quality, (2) an action +planning module that determines knowledge needs and generates focused +sub-queries, (3) a dynamic knowledge structuring approach that converts +retrieved text into an evolving knowledge graph, and (4) a graph-augmented +answering component that leverages the accumulated structured information. Our +framework achieves state-of-the-art performance, surpassing leading baselines +by 6.4% with open-source language models and 7.0% with proprietary models on +seven knowledge-intensive generation datasets across all evaluation metrics. +Detailed ablation studies verify the contribution of each technical component +to the overall system performance. + +摘要:检索增强语言模型通常会因检索效率低、知识整合无结构和单次通过架构而难以胜任知识密集型任务。我们提出检索和结构化 (RAS),这是一个新颖的框架,通过迭代检索和结构化,动态构建和推理特定于查询的知识图谱。RAS 引入了四项关键技术创新:(1) 主题范围检索机制,在保持检索质量的同时有效缩小搜索空间,(2) 动作规划模块,确定知识需求并生成重点子查询,(3) 动态知识结构化方法,将检索到的文本转换为不断发展的知识图谱,以及 (4) 图谱增强型回答组件,利用累积的结构化信息。我们的框架实现了最先进的性能,在七个知识密集型生成数据集上,使用开源语言模型提高了 6.4%,使用专有模型提高了 7.0%,超越了领先的基线,且所有评估指标均如此。详细的消融研究验证了每个技术组件对整体系统性能的贡献。 + +##### **Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia** +2502.10896v1 by Rohith Perumandla, Young-Ho Bae, Diego Izaguirre, Esther Hwang, Andrew Murphy, Long-Jing Hsu, Selma Sabanovic, Casey C. Bennett + +This study presents the development and testing of a conversational speech +system designed for robots to detect speech biomarkers indicative of cognitive +impairments in people living with dementia (PLwD). The system integrates a +backend Python WebSocket server and a central core module with a large language +model (LLM) fine-tuned for dementia to process user input and generate robotic +conversation responses in real-time in less than 1.5 seconds. The frontend user +interface, a Progressive Web App (PWA), displays information and biomarker +score graphs on a smartphone in real-time to human users (PLwD, caregivers, +clinicians). Six speech biomarkers based on the existing literature - Altered +Grammar, Pragmatic Impairments, Anomia, Disrupted Turn-Taking, Slurred +Pronunciation, and Prosody Changes - were developed for the robot conversation +system using two datasets, one that included conversations of PLwD with a human +clinician (DementiaBank dataset) and one that included conversations of PLwD +with a robot (Indiana dataset). We also created a composite speech biomarker +that combined all six individual biomarkers into a single score. The speech +system's performance was first evaluated on the DementiaBank dataset showing +moderate correlation with MMSE scores, with the composite biomarker score +outperforming individual biomarkers. Analysis of the Indiana dataset revealed +higher and more variable biomarker scores, suggesting potential differences due +to study populations (e.g. severity of dementia) and the conversational +scenario (human-robot conversations are different from human-human). The +findings underscore the need for further research on the impact of +conversational scenarios on speech biomarkers and the potential clinical +applications of robotic speech systems. + +摘要:本研究展示了對話式語音系統的開發和測試,該系統專為機器人設計,用於偵測失智症患者(PLwD)認知障礙的語言生物標記。該系統整合了後端 Python WebSocket 伺服器和一個中央核心模組,其中包含針對失智症微調的大語言模型(LLM),以處理使用者輸入並在不到 1.5 秒的時間內產生機器人對話回應。前端使用者介面(漸進式網路應用程式,PWA)會在智慧型手機上即時向人類使用者(PLwD、照護者、臨床醫生)顯示資訊和生物標記評分圖表。根據現有文獻,針對機器人對話系統開發了六個語言生物標記:語法改變、實用障礙、失語症、輪流中斷、發音不清和韻律變化,使用了兩個資料集,一個包含 PLwD 與人類臨床醫生對話(DementiaBank 資料集),另一個包含 PLwD 與機器人對話(Indiana 資料集)。我們還建立了一個複合語言生物標記,將所有六個個別生物標記組合成一個單一評分。語言系統的效能首先在 DementiaBank 資料集上進行評估,顯示與 MMSE 評分有中等相關性,複合生物標記評分優於個別生物標記。對 Indiana 資料集的分析顯示出較高且變異性較大的生物標記評分,這表明由於研究族群(例如失智症的嚴重程度)和對話情境(人機對話與人際對話不同)而產生潛在差異。研究結果強調需要進一步研究對話情境對語言生物標記的影響,以及機器人語言系統的潛在臨床應用。 + +##### **Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)** +2502.10768v1 by Sandra Schaftner + +Current research highlights the great potential of Large Language Models +(LLMs) for constructing Scholarly Knowledge Graphs (SKGs). One particularly +complex step in this process is relation extraction, aimed at identifying +suitable properties to describe the content of research. This study builds +directly on previous research of three Open Research Knowledge Graph (ORKG) +team members who assessed the readiness of LLMs such as GPT-3.5, Llama 2, and +Mistral for property extraction in scientific literature. Given the moderate +performance observed, the previous work concluded that fine-tuning is needed to +improve these models' alignment with scientific tasks and their emulation of +human expertise. Expanding on this prior experiment, this study evaluates the +impact of advanced prompt engineering techniques and demonstrates that these +techniques can highly significantly enhance the results. Additionally, this +study extends the property extraction process to include property matching to +existing ORKG properties, which are retrieved via the API. The evaluation +reveals that results generated through advanced prompt engineering achieve a +higher proportion of matches with ORKG properties, further emphasizing the +enhanced alignment achieved. Moreover, this lays the groundwork for addressing +challenges such as the inconsistency of ORKG properties, an issue highlighted +in prior studies. By assigning unique URIs and using standardized terminology, +this work increases the consistency of the properties, fulfilling a crucial +aspect of Linked Data and FAIR principles - core commitments of ORKG. This, in +turn, significantly enhances the applicability of ORKG content for subsequent +tasks such as comparisons of research publications. Finally, the study +concludes with recommendations for future improvements in the overall property +extraction process. + +摘要:目前的調查強調大語言模型 (LLM) 在建構學術知識圖譜 (SKG) 上的巨大潛力。此過程中特別複雜的步驟是關係萃取,目標是找出合適的屬性來描述研究內容。本研究直接建立在三位開放研究知識圖譜 (ORKG) 團隊成員先前研究的基礎上,他們評估了 GPT-3.5、Llama 2 和 Mistral 等 LLM 在科學文獻中萃取屬性的準備情況。鑑於觀察到的表現中等,先前的研究結論是需要微調,以改善這些模型與科學任務的一致性,以及它們對人類專業知識的模擬。本研究擴展了先前的實驗,評估了進階提示工程技術的影響,並證明這些技術可以大幅顯著地提升結果。此外,本研究將屬性萃取流程擴展到包含與現有 ORKG 屬性的屬性比對,這些屬性是透過 API 擷取的。評估結果顯示,透過進階提示工程產生的結果與 ORKG 屬性有更高的比對比例,進一步強調所達成的進階一致性。此外,這也為了解決先前的研究中強調的問題,例如 ORKG 屬性的不一致性,奠定了基礎。透過指定唯一的 URI 並使用標準化的術語,本研究增加了屬性的相容性,達成了連結資料和 FAIR 原則的重要層面,這是 ORKG 的核心承諾。這反過來大幅提升了 ORKG 內容在後續任務中的適用性,例如研究出版品的比較。最後,本研究以針對整體屬性萃取流程未來改進的建議作為結論。 + +##### **K-Edit: Language Model Editing with Contextual Knowledge Awareness** +2502.10626v1 by Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan + +As the world changes, we need to be able to update our models and correct +false information without costly retraining. Knowledge-based model editing +enables precise modifications to the weights of large language models in order +to modify the information encoded within. Recent approaches have seen success +in enabling recall of edited information for thousands of edits at once. +However, these approaches fail to produce edits that account for associated +contextual information. We present K-Edit, an effective approach to generating +contextually consistent knowledge edits. By using knowledge graphs, which +maintain contextual consistency when an edge is edited, we are able to generate +additional \textit{contextual edits} that ensure consistency of related +information in the language model. Our experiments demonstrate significant +improvements in multi-hop question answering while maintaining the general +effectiveness and scalability of model edits. + +摘要:隨著世界變化,我們需要能夠更新我們的模型,並在不進行昂貴的重新訓練的情況下更正錯誤資訊。基於知識的模型編輯能夠對大型語言模型的權重進行精確修改,以便修改其中編碼的資訊。最近的方法在一次啟用數千次編輯的編輯資訊的召回方面取得了成功。然而,這些方法無法產生考慮相關上下文資訊的編輯。我們提出 K-Edit,這是一種產生上下文一致的知識編輯的有效方法。通過使用知識圖,在編輯邊緣時保持上下文一致性,我們能夠產生額外的「上下文編輯」,以確保語言模型中相關資訊的一致性。我們的實驗證明了多跳問題回答的顯著改進,同時保持了模型編輯的一般有效性和可擴充性。 + +##### **ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis** +2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan + +Recent advancements in large language models (LLMs) have demonstrated +extraordinary comprehension capabilities with remarkable breakthroughs on +various vision-language tasks. However, the application of LLMs in generating +reliable medical diagnostic reports remains in the early stages. Currently, +medical LLMs typically feature a passive interaction model where doctors +respond to patient queries with little or no involvement in analyzing medical +images. In contrast, some ChatBots simply respond to predefined queries based +on visual inputs, lacking interactive dialogue or consideration of medical +history. As such, there is a gap between LLM-generated patient-ChatBot +interactions and those occurring in actual patient-doctor consultations. To +bridge this gap, we develop an LLM-based dialogue system, namely proactive +multi-round vision-language interactions for computer-aided diagnosis +(ProMRVL-CAD), to generate patient-friendly disease diagnostic reports. The +proposed ProMRVL-CAD system allows proactive dialogue to provide patients with +constant and reliable medical access via an integration of knowledge graph into +a recommendation system. Specifically, we devise two generators: a Proactive +Question Generator (Pro-Q Gen) to generate proactive questions that guide the +diagnostic procedure and a Multi-Vision Patient-Text Diagnostic Report +Generator (MVP-DR Gen) to produce high-quality diagnostic reports. Evaluating +two real-world publicly available datasets, MIMIC-CXR and IU-Xray, our model +has better quality in generating medical reports. We further demonstrate the +performance of ProMRVL achieves robust under the scenarios with low image +quality. Moreover, we have created a synthetic medical dialogue dataset that +simulates proactive diagnostic interactions between patients and doctors, +serving as a valuable resource for training LLM. + +摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。 + +##### **GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs** +2502.10522v1 by Shima Khoshraftar, Niaz Abedini, Amir Hajian + +The application of large language models (LLMs) to graph data has attracted a +lot of attention recently. LLMs allow us to use deep contextual embeddings from +pretrained models in text-attributed graphs, where shallow embeddings are often +used for the text attributes of nodes. However, it is still challenging to +efficiently encode the graph structure and features into a sequential form for +use by LLMs. In addition, the performance of an LLM alone, is highly dependent +on the structure of the input prompt, which limits their effectiveness as a +reliable approach and often requires iterative manual adjustments that could be +slow, tedious and difficult to replicate programmatically. In this paper, we +propose GraphiT (Graphs in Text), a framework for encoding graphs into a +textual format and optimizing LLM prompts for graph prediction tasks. Here we +focus on node classification for text-attributed graphs. We encode the graph +data for every node and its neighborhood into a concise text to enable LLMs to +better utilize the information in the graph. We then further programmatically +optimize the LLM prompts using the DSPy framework to automate this step and +make it more efficient and reproducible. GraphiT outperforms our LLM-based +baselines on three datasets and we show how the optimization step in GraphiT +leads to measurably better results without manual prompt tweaking. We also +demonstrated that our graph encoding approach is competitive to other graph +encoding methods while being less expensive because it uses significantly less +tokens for the same task. + +摘要:大型語言模型 (LLM) 在圖表資料的應用最近備受關注。LLM 讓我們能夠在文字標記圖表中使用預訓練模型的深度脈絡嵌入,其中淺層嵌入通常用於節點的文字屬性。然而,要有效率地將圖表結構和特徵編碼成序列形式供 LLM 使用,仍然是一項挑戰。此外,單獨 LLM 的效能高度依賴輸入提示的結構,這限制了它們作為可靠方法的有效性,而且通常需要反覆的人工調整,這可能會緩慢、繁瑣且難以透過程式複製。在本文中,我們提出 GraphiT(文字中的圖表),一個用於將圖表編碼成文字格式並最佳化 LLM 提示以進行圖表預測任務的架構。在這裡,我們專注於文字標記圖表的節點分類。我們將每個節點及其鄰域的圖表資料編碼成簡潔的文字,讓 LLM 能夠更好地利用圖表中的資訊。然後,我們進一步透過程式最佳化 LLM 提示,使用 DSPy 架構自動化這個步驟,並使其更有效率且可複製。Graphite 在三個資料集上優於我們的基於 LLM 的基準,我們展示了 GraphiT 中的最佳化步驟如何導致顯著更好的結果,而無需手動調整提示。我們還證明了我們的圖表編碼方法與其他圖表編碼方法具有競爭力,同時成本更低,因為它在相同的任務中使用了顯著更少的標記。 + +##### **Do Large Language Models Reason Causally Like Us? Even Better?** +2502.10215v1 by Hanna M. Dettki, Brenden M. Lake, Charley M. Wu, Bob Rehder + +Causal reasoning is a core component of intelligence. Large language models +(LLMs) have shown impressive capabilities in generating human-like text, +raising questions about whether their responses reflect true understanding or +statistical patterns. We compared causal reasoning in humans and four LLMs +using tasks based on collider graphs, rating the likelihood of a query variable +occurring given evidence from other variables. We find that LLMs reason +causally along a spectrum from human-like to normative inference, with +alignment shifting based on model, context, and task. Overall, GPT-4o and +Claude showed the most normative behavior, including "explaining away", whereas +Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected +independence of causes - Claude the least - they exhibited strong associative +reasoning and predictive inference when assessing the likelihood of the effect +given its causes. These findings underscore the need to assess AI biases as +they increasingly assist human decision-making. + +摘要:因果推理是智能的核心組成部分。大型語言模型 (LLM) 在生成類人文本方面展現了令人印象深刻的能力,引發了關於它們的回應是否反映真實理解或統計模式的疑問。我們使用基於碰撞圖的任務比較了人類和四個 LLM 中的因果推理,根據其他變數的證據評估查詢變數發生的可能性。我們發現 LLM 沿著從類人到規範推論的光譜進行因果推理,對齊會根據模型、上下文和任務而改變。總體而言,GPT-4o 和 Claude 表現出最規範的行為,包括「解釋」,而 Gemini-Pro 和 GPT-3.5 則沒有。儘管所有代理都偏離了預期的原因獨立性 - Claude 最不偏離 - 但它們在評估給定原因的效果可能性時表現出強烈的關聯推理和預測推論。這些發現強調了評估 AI 偏差的必要性,因為它們越來越協助人類決策。 + +##### **Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages** +2502.10140v1 by Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann + +Low-resource languages (LRLs) face significant challenges in natural language +processing (NLP) due to limited data. While current state-of-the-art large +language models (LLMs) still struggle with LRLs, smaller multilingual models +(mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of +their capacity to low training data sizes. This study systematically +investigates parameter-efficient adapter-based methods for adapting mLMs to +LRLs, evaluating three architectures: Sequential Bottleneck, Invertible +Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and +structured knowledge from ConceptNet, we show that small adaptation datasets +(e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains +in intrinsic (masked language modeling) and extrinsic tasks (topic +classification, sentiment analysis, and named entity recognition). We find that +Sequential Bottleneck adapters excel in language modeling, while Invertible +Bottleneck adapters slightly outperform other methods on downstream tasks due +to better embedding alignment and larger parameter counts. Adapter-based +methods match or outperform full fine-tuning while using far fewer parameters, +and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, +GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves +performance, pre-training data size remains the dominant factor, especially for +languages with extensive pre-training coverage. + +摘要:低資源語言 (LRL) 由於資料有限,在自然語言處理 (NLP) 中面臨重大挑戰。雖然當前最先進的大型語言模型 (LLM) 仍難以處理 LRL,但較小的多語言模型 (mLMS),例如 mBERT 和 XLM-R,由於其容量更適合低訓練資料大小,因此提供了更大的希望。本研究系統性地探討了基於參數效率適配器的適配方法,以將 mLMS 適配到 LRL,評估了三種架構:順序瓶頸、可逆瓶頸和低秩適配。使用來自 GlotCC 的非結構化文本和來自 ConceptNet 的結構化知識,我們表明小型適配資料集(例如,高達 1 GB 的自由文本或幾 MB 的知識圖譜資料)在內在(遮蔽語言模型)和外在任務(主題分類、情緒分析和命名實體識別)中產生增益。我們發現順序瓶頸適配器在語言模型中表現出色,而可逆瓶頸適配器由於更好的嵌入對齊和更大的參數數量,在下游任務上略勝於其他方法。基於適配器的方法在使用更少參數的同時,可以匹配或優於完全微調,而較小的 mLM 被證明比 LLaMA-3、GPT-4 和基於 DeepSeek-R1 的蒸餾模型等大型 LLM 更適合 LRL。雖然適配可以提高效能,但預訓練資料大小仍然是主要因素,特別是對於預訓練覆蓋範圍廣泛的語言。 + +##### **Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models** +2502.10090v1 by Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao + +Humans possess an extraordinary ability to understand and execute complex +manipulation tasks by interpreting abstract instruction manuals. For robots, +however, this capability remains a substantial challenge, as they cannot +interpret abstract instructions and translate them into executable actions. In +this paper, we present Manual2Skill, a novel framework that enables robots to +perform complex assembly tasks guided by high-level manual instructions. Our +approach leverages a Vision-Language Model (VLM) to extract structured +information from instructional images and then uses this information to +construct hierarchical assembly graphs. These graphs represent parts, +subassemblies, and the relationships between them. To facilitate task +execution, a pose estimation model predicts the relative 6D poses of components +at each assembly step. At the same time, a motion planning module generates +actionable sequences for real-world robotic implementation. We demonstrate the +effectiveness of Manual2Skill by successfully assembling several real-world +IKEA furniture items. This application highlights its ability to manage +long-horizon manipulation tasks with both efficiency and precision, +significantly enhancing the practicality of robot learning from instruction +manuals. This work marks a step forward in advancing robotic systems capable of +understanding and executing complex manipulation tasks in a manner akin to +human capabilities. + +摘要:人類擁有理解並執行複雜操作任務的非凡能力,方法是詮釋抽象的說明手冊。然而,對機器人來說,這項能力仍然是一項重大的挑戰,因為它們無法詮釋抽象的指令並將其轉換為可執行的動作。在本文中,我們提出了 Manual2Skill,這是一個新穎的框架,使機器人能夠在高階手冊說明的指導下執行複雜的組裝任務。我們的做法利用視覺語言模型 (VLM) 從教學圖片中提取結構化資訊,然後使用此資訊來建構階層式組裝圖。這些圖表示零件、子組件以及它們之間的關係。為了促進任務執行,姿勢估計模型會預測每個組裝步驟中組件的相對 6D 姿勢。同時,動作規劃模組會產生適用於實際機器人實作的可操作順序。我們透過成功組裝幾個真實世界的 IKEA 家具來展示 Manual2Skill 的有效性。此應用程式突顯了它以高效率和高精準度管理長時程操作任務的能力,大幅提升機器人從說明手冊中學習的實用性。這項工作標誌著機器人系統在理解和執行複雜操作任務方面向前邁進了一步,其方式類似於人類的能力。 + +##### **Decision Information Meets Large Language Models: The Future of Explainable Operations Research** +2502.09994v1 by Yansen Zhang, Qingcan Kang, Wing Yin Yu, Hailei Gong, Xiaojin Fu, Xiongwei Han, Tao Zhong, Chen Ma + +Operations Research (OR) is vital for decision-making in many industries. +While recent OR methods have seen significant improvements in automation and +efficiency through integrating Large Language Models (LLMs), they still +struggle to produce meaningful explanations. This lack of clarity raises +concerns about transparency and trustworthiness in OR applications. To address +these challenges, we propose a comprehensive framework, Explainable Operations +Research (EOR), emphasizing actionable and understandable explanations +accompanying optimization. The core of EOR is the concept of Decision +Information, which emerges from what-if analysis and focuses on evaluating the +impact of complex constraints (or parameters) changes on decision-making. +Specifically, we utilize bipartite graphs to quantify the changes in the OR +model and adopt LLMs to improve the explanation capabilities. Additionally, we +introduce the first industrial benchmark to rigorously evaluate the +effectiveness of explanations and analyses in OR, establishing a new standard +for transparency and clarity in the field. + +摘要:作業研究 (OR) 對許多產業的決策制定至關重要。雖然近期的 OR 方法已透過整合大型語言模型 (LLM) 在自動化和效率方面取得顯著的進步,但它們在產生有意義的解釋方面仍面臨挑戰。這種缺乏明確性的情況會對 OR 應用中的透明度和可信度造成疑慮。為了應對這些挑戰,我們提出一個全面的架構,即可解釋作業研究 (EOR),強調在最佳化過程中提供可操作且易於理解的解釋。EOR 的核心是決策資訊的概念,它源自假設分析,並專注於評估複雜約束條件 (或參數) 變更對決策制定的影響。具體來說,我們利用二部圖量化 OR 模型的變化,並採用 LLM 來改善解釋能力。此外,我們引入了第一個產業基準,以嚴格評估 OR 中解釋和分析的有效性,為該領域的透明度和清晰度建立新的標準。 + +##### **KGGen: Extracting Knowledge Graphs from Plain Text with Language Models** +2502.09956v1 by Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo + +Recent interest in building foundation models for KGs has highlighted a +fundamental challenge: knowledge-graph data is relatively scarce. The +best-known KGs are primarily human-labeled, created by pattern-matching, or +extracted using early NLP techniques. While human-generated KGs are in short +supply, automatically extracted KGs are of questionable quality. We present a +solution to this data scarcity problem in the form of a text-to-KG generator +(KGGen), a package that uses language models to create high-quality graphs from +plaintext. Unlike other KG extractors, KGGen clusters related entities to +reduce sparsity in extracted KGs. KGGen is available as a Python library +(\texttt{pip install kg-gen}), making it accessible to everyone. Along with +KGGen, we release the first benchmark, Measure of of Information in Nodes and +Edges (MINE), that tests an extractor's ability to produce a useful KG from +plain text. We benchmark our new tool against existing extractors and +demonstrate far superior performance. + +摘要:最近对于构建知识图谱基础模型的兴趣凸显了一个基本挑战:知识图谱数据相对稀缺。最知名的知识图谱主要为人标注,由模式匹配创建,或使用早期自然语言处理技术提取。虽然人生成的知识图谱供不应求,但自动提取的知识图谱质量堪忧。我们以文本到知识图谱生成器 (KGGen) 的形式为这一数据稀缺问题提供了一个解决方案,这是一个使用语言模型从纯文本创建高质量图表的包。与其他知识图谱提取器不同,KGGen 对相关实体进行聚类以减少提取的知识图谱中的稀疏性。KGGen 可用作 Python 库(\texttt{pip install kg-gen}),使其所有人都能访问。除了 KGGen,我们还发布了第一个基准测试,即节点和边信息度量 (MINE),它测试了提取器从纯文本生成有用知识图谱的能力。我们针对现有提取器对我们的新工具进行基准测试,并展示了远超其性能。 + +##### **ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation** +2502.09891v1 by Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, Yuchi Ma + +Retrieval-Augmented Generation (RAG) has proven effective in integrating +external knowledge into large language models (LLMs) for question-answer (QA) +tasks. The state-of-the-art RAG approaches often use the graph data as the +external data since they capture the rich semantic information and link +relationships between entities. However, existing graph-based RAG approaches +cannot accurately identify the relevant information from the graph and also +consume large numbers of tokens in the online retrieval process. To address +these issues, we introduce a novel graph-based RAG approach, called Attributed +Community-based Hierarchical RAG (ArchRAG), by augmenting the question using +attributed communities, and also introducing a novel LLM-based hierarchical +clustering method. To retrieve the most relevant information from the graph for +the question, we build a novel hierarchical index structure for the attributed +communities and develop an effective online retrieval method. Experimental +results demonstrate that ArchRAG outperforms existing methods in terms of both +accuracy and token cost. + +摘要:檢索增強生成 (RAG) 已證明可將外部知識整合到大型語言模型 (LLM),用於問答 (QA) 任務。最先進的 RAG 方法通常使用圖形資料作為外部資料,因為它們擷取了豐富的語意資訊和實體之間的連結關係。然而,現有的基於圖形的 RAG 方法無法準確識別圖形中的相關資訊,而且在線上檢索過程中也會消耗大量的符號。為了解決這些問題,我們提出了一種新穎的基於圖形的 RAG 方法,稱為基於屬性社群的分層 RAG (ArchRAG),透過使用屬性社群來擴充問題,並引入一種新穎的基於 LLM 的分層聚類方法。為了從圖形中檢索與問題最相關的資訊,我們為屬性社群建立了一個新穎的分層索引結構,並開發了一種有效的線上檢索方法。實驗結果證明,ArchRAG 在準確性和符號成本方面都優於現有方法。 + +##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing** +2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch + +Visual Question Answering (VQA) is a challenging problem that requires to +process multimodal input. Answer-Set Programming (ASP) has shown great +potential in this regard to add interpretability and explainability to modular +VQA architectures. In this work, we address the problem of how to integrate ASP +with modules for vision and natural language processing to solve a new and +demanding VQA variant that is concerned with images of graphs (not graphs in +symbolic form). Images containing graph-based structures are an ubiquitous and +popular form of visualisation. Here, we deal with the particular problem of +graphs inspired by transit networks, and we introduce a novel dataset that +amends an existing one by adding images of graphs that resemble metro lines. +Our modular neuro-symbolic approach combines optical graph recognition for +graph parsing, a pretrained optical character recognition neural network for +parsing labels, Large Language Models (LLMs) for language processing, and ASP +for reasoning. This method serves as a first baseline and achieves an overall +average accuracy of 73% on the dataset. Our evaluation provides further +evidence of the potential of modular neuro-symbolic systems, in particular with +pretrained models that do not involve any further training and logic +programming for reasoning, to solve complex VQA tasks. + +摘要:視覺問答(VQA)是一項具有挑戰性的問題,需要處理多模態輸入。答案集程式設計(ASP)在這方面顯示出巨大的潛力,可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中,我們探討如何將 ASP 與視覺和自然語言處理模組整合,以解決一個新的且要求嚴格的 VQA 變體,該變體與圖形影像(而非符號形式的圖形)有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡,我們處理受交通網路啟發的圖形特定問題,並引入一個新的資料集,透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型(LLM)進行語言處理,以及 ASP 進行推理。此方法作為第一個基準,在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力,特別是預先訓練的模型,這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理,以解決複雜的 VQA 任務。 + +##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data** +2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai + +The adoption of EHRs has expanded opportunities to leverage data-driven +algorithms in clinical care and research. A major bottleneck in effectively +conducting multi-institutional EHR studies is the data heterogeneity across +systems with numerous codes that either do not exist or represent different +clinical concepts across institutions. The need for data privacy further limits +the feasibility of including multi-institutional patient-level data required to +study similarities and differences across patient subgroups. To address these +challenges, we developed the GAME algorithm. Tested and validated across 7 +institutions and 2 languages, GAME integrates data in several levels: (1) at +the institutional level with knowledge graphs to establish relationships +between codes and existing knowledge sources, providing the medical context for +standard codes and their relationship to each other; (2) between institutions, +leveraging language models to determine the relationships between +institution-specific codes with established standard codes; and (3) quantifying +the strength of the relationships between codes using a graph attention +network. Jointly trained embeddings are created using transfer and federated +learning to preserve data privacy. In this study, we demonstrate the +applicability of GAME in selecting relevant features as inputs for AI-driven +algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. +We then highlight the application of GAME harmonized multi-institutional EHR +data in a study of Alzheimer's disease outcomes and suicide risk among patients +with mental health disorders, without sharing patient-level data outside +individual institutions. + +摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。 + +##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy** +2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang + +With the extensive application of Graph Neural Networks (GNNs) across various +domains, their trustworthiness has emerged as a focal point of research. Some +existing studies have shown that the integration of large language models +(LLMs) can improve the semantic understanding and generation capabilities of +GNNs, which in turn improves the trustworthiness of GNNs from various aspects. +Our review introduces a taxonomy that offers researchers a clear framework for +comprehending the principles and applications of different methods and helps +clarify the connections and differences among various approaches. Then we +systematically survey representative approaches along the four categories of +our taxonomy. Through our taxonomy, researchers can understand the applicable +scenarios, potential advantages, and limitations of each approach for the the +trusted integration of GNNs with LLMs. Finally, we present some promising +directions of work and future trends for the integration of LLMs and GNNs to +improve model trustworthiness. + +摘要:隨著圖神經網路 (GNN) 在各種領域的廣泛應用,其可信度已成為研究的焦點。一些現有研究表明,整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力,進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法,為研究人員提供了一個清晰的架構,用於理解不同方法的原理和應用,並有助於釐清各種方法之間的關聯和差異。然後,我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法,可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後,我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢,以提升模型的可信度。 + +##### **Graph Foundation Models for Recommendation: A Comprehensive Survey** +2502.08346v3 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi + +Recommender systems (RS) serve as a fundamental tool for navigating the vast +expanse of online information, with deep learning advancements playing an +increasingly important role in improving ranking accuracy. Among these, graph +neural networks (GNNs) excel at extracting higher-order structural information, +while large language models (LLMs) are designed to process and comprehend +natural language, making both approaches highly effective and widely adopted. +Recent research has focused on graph foundation models (GFMs), which integrate +the strengths of GNNs and LLMs to model complex RS problems more efficiently by +leveraging the graph-based structure of user-item relationships alongside +textual understanding. In this survey, we provide a comprehensive overview of +GFM-based RS technologies by introducing a clear taxonomy of current +approaches, diving into methodological details, and highlighting key challenges +and future directions. By synthesizing recent advancements, we aim to offer +valuable insights into the evolving landscape of GFM-based recommender systems. + +摘要:推薦系統 (RS) 是用於導航廣闊的線上資訊的基本工具,深度學習的進步在提升排名準確度方面扮演著日益重要的角色。其中,圖形神經網路 (GNN) 擅長萃取高階結構資訊,而大型語言模型 (LLM) 則設計用於處理和理解自然語言,這使得這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM),它整合了 GNN 和 LLM 的優點,透過利用使用者與項目關係的圖形化結構以及文字理解,更有效率地建構複雜的 RS 問題模型。在這項調查中,我們透過介紹當前方法的明確分類、深入探討方法論細節,以及強調關鍵挑戰和未來方向,提供了 GFM 為基礎的 RS 技術的全面概觀。透過綜合最近的進展,我們旨在提供對 GFM 為基礎的推薦系統不斷演變的版圖的寶貴見解。 + +##### **Self-Evaluation for Job-Shop Scheduling** +2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana + +Combinatorial optimization problems, such as scheduling and route planning, +are crucial in various industries but are computationally intractable due to +their NP-hard nature. Neural Combinatorial Optimization methods leverage +machine learning to address these challenges but often depend on sequential +decision-making, which is prone to error accumulation as small mistakes +propagate throughout the process. Inspired by self-evaluation techniques in +Large Language Models, we propose a novel framework that generates and +evaluates subsets of assignments, moving beyond traditional stepwise +approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a +heterogeneous graph neural network with a Transformer to build a policy model +and a self-evaluation function. Experimental validation on challenging, +well-known benchmarks demonstrates the effectiveness of our approach, +surpassing state-of-the-art methods. + +摘要:組合優化問題,例如排程和路線規劃,在各行各業中至關重要,但由於它們的 NP 難度,在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰,但通常依賴於序貫決策制定,而序貫決策制定容易發生錯誤累積,因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發,我們提出了一個新的框架,可生成和評估作業子集,超越傳統的分步方法。應用於工作車間排程問題,我們的方法將異質圖神經網路與 Transformer 整合在一起,以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性,超越了最先進的方法。 + +##### **Improving Existing Optimization Algorithms with LLMs** +2502.08298v1 by Camilo Chacón Sartori, Christian Blum + +The integration of Large Language Models (LLMs) into optimization has created +a powerful synergy, opening exciting research opportunities. This paper +investigates how LLMs can enhance existing optimization algorithms. Using their +pre-trained knowledge, we demonstrate their ability to propose innovative +heuristic variations and implementation strategies. To evaluate this, we +applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt +(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that +incorporates a heuristic in the solution construction phase. Our results show +that an alternative heuristic proposed by GPT-4o outperforms the +expert-designed heuristic of CMSA, with the performance gap widening on larger +and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/ + +摘要:大型语言模型 (LLM) 与优化相结合,创造了一种强大的协同作用,开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识,我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点,我们应用了一种非平凡的优化算法,构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法,它在求解构建阶段纳入了启发式算法。我们的结果表明,GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法,并且随着图形变得更大、更密集,性能差距也在扩大。项目网址:https://imp-opt-algo-llms.surge.sh/ + +##### **LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search** +2502.10459v1 by Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang + +Graph Neural Architecture Search (GNAS) facilitates the automatic design of +Graph Neural Networks (GNNs) tailored to specific downstream graph learning +tasks. However, existing GNAS approaches often require manual adaptation to new +graph search spaces, necessitating substantial code optimization and +domain-specific knowledge. To address this challenge, we present LLM4GNAS, a +toolkit for GNAS that leverages the generative capabilities of Large Language +Models (LLMs). LLM4GNAS includes an algorithm library for graph neural +architecture search algorithms based on LLMs, enabling the adaptation of GNAS +methods to new search spaces through the modification of LLM prompts. This +approach reduces the need for manual intervention in algorithm adaptation and +code modification. The LLM4GNAS toolkit is extensible and robust, incorporating +LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture +search, and LLM-enhanced hyperparameter optimization. Experimental results +indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving +both homogeneous and heterogeneous graphs. + +摘要:圖形神經架構搜尋 (GNAS) 促進圖形神經網路 (GNN) 的自動設計,以符合特定下游圖形學習任務。然而,現有的 GNAS 方法通常需要手動調整至新的圖形搜尋空間,這需要大量的程式碼最佳化和領域特定知識。為了應對這項挑戰,我們提出 LLM4GNAS,一個利用大型語言模型 (LLM) 的生成能力的 GNAS 工具包。LLM4GNAS 包含一個基於 LLM 的圖形神經架構搜尋演算法函式庫,讓 GNAS 方法能夠透過修改 LLM 提示來適應新的搜尋空間。這種方法減少了演算法適應和程式碼修改中手動介入的需要。LLM4GNAS 工具包具有可擴充性和穩健性,整合了 LLM 增強的圖形特徵工程、LLM 增強的圖形神經架構搜尋和 LLM 增強的超參數最佳化。實驗結果表明,LLM4GNAS 在涉及同質和異質圖形的任務上優於現有的 GNAS 方法。 + +##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning** +2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari + +Identifying cause-and-effect relationships is critical to understanding +real-world dynamics and ultimately causal reasoning. Existing methods for +identifying event causality in NLP, including those based on Large Language +Models (LLMs), exhibit difficulties in out-of-distribution settings due to the +limited scale and heavy reliance on lexical cues within available benchmarks. +Modern benchmarks, inspired by probabilistic causal inference, have attempted +to construct causal graphs of events as a robust representation of causal +knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent +benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a +benchmark designed for discovery and reasoning over abstract causal events. +Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday +life events on the abstraction level. We propose a pipeline for identifying +abstractions for event generalizations from \texttt{GLUCOSE} +\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit +commonsense causal knowledge, from which we subsequently extract $1,4$K causal +pairs. Our experiments highlight the ongoing challenges of using statistical +methods and/or LLMs for automatic abstraction identification and causal +discovery in NLP. Nonetheless, we demonstrate that the abstract causal +knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA +reasoning performance in LLMs. + +摘要:找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法,包括基於大型語言模型 (LLM) 的方法,由於規模有限且過度依賴於可用基準中的詞彙線索,在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖,作為因果知識的強健表示,其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中,我們介紹 \texttt{ACCESS},一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同,\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道,用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象,\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集,我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此,我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。 + +##### **Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality** +2502.09658v1 by Xin Kang, Veronika Shteingardt, Yuhan Wang, Dov Dori + +Knowledge representation and reasoning are critical challenges in Artificial +Intelligence (AI), particularly in integrating neural and symbolic approaches +to achieve explainable and transparent AI systems. Traditional knowledge +representation methods often fall short of capturing complex processes and +state changes. We introduce Neuro-Conceptual Artificial Intelligence (NCAI), a +specialization of the neuro-symbolic AI approach that integrates conceptual +modeling using Object-Process Methodology (OPM) ISO 19450:2024 with deep +learning to enhance question-answering (QA) quality. By converting natural +language text into OPM models using in-context learning, NCAI leverages the +expressive power of OPM to represent complex OPM elements-processes, objects, +and states-beyond what traditional triplet-based knowledge graphs can easily +capture. This rich structured knowledge representation improves reasoning +transparency and answer accuracy in an OPM-QA system. We further propose +transparency evaluation metrics to quantitatively measure how faithfully the +predicted reasoning aligns with OPM-based conceptual logic. Our experiments +demonstrate that NCAI outperforms traditional methods, highlighting its +potential for advancing neuro-symbolic AI by providing rich knowledge +representations, measurable transparency, and improved reasoning. + +摘要:知識表徵與推理是人工智慧 (AI) 中的重大挑戰,特別是在整合神經與符號方法以實現可解釋且透明的人工智慧系統時。傳統的知識表徵方法通常無法捕捉複雜的流程和狀態變化。我們引入了神經概念人工智慧 (NCAI),一種神經符號 AI 方法的專門化,它將使用物件流程方法 (OPM) ISO 19450:2024 的概念建模與深度學習整合在一起,以提升問答 (QA) 的品質。透過使用情境學習將自然語言文字轉換為 OPM 模型,NCAI 充分利用 OPM 的表達能力來表徵複雜的 OPM 元素(流程、物件和狀態),超越傳統的三元組知識圖表容易捕捉的範圍。這種豐富的結構化知識表徵改善了 OPM-QA 系統中的推理透明度和答案準確度。我們進一步提出了透明度評估指標,以量化測量預測推理與基於 OPM 的概念邏輯的吻合程度。我們的實驗證明,NCAI 優於傳統方法,突顯了它在透過提供豐富的知識表徵、可測量的透明度和改善的推理來推進神經符號 AI 的潛力。 + +##### **GCoT: Chain-of-Thought Prompt Learning for Graphs** +2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang + +Chain-of-thought (CoT) prompting has achieved remarkable success in natural +language processing (NLP). However, its vast potential remains largely +unexplored for graphs. This raises an interesting question: How can we design +CoT prompting for graphs to guide graph models to learn step by step? On one +hand, unlike natural languages, graphs are non-linear and characterized by +complex topological structures. On the other hand, many graphs lack textual +data, making it difficult to formulate language-based CoT prompting. In this +work, we propose the first CoT prompt learning framework for text-free graphs, +GCoT. Specifically, we decompose the adaptation process for each downstream +task into a series of inference steps, with each step consisting of +prompt-based inference, ``thought'' generation, and thought-conditioned prompt +learning. While the steps mimic CoT prompting in NLP, the exact mechanism +differs significantly. Specifically, at each step, an input graph, along with a +prompt, is first fed into a pre-trained graph encoder for prompt-based +inference. We then aggregate the hidden layers of the encoder to construct a +``thought'', which captures the working state of each node in the current step. +Conditioned on this thought, we learn a prompt specific to each node based on +the current state. These prompts are fed into the next inference step, +repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we +conduct comprehensive experiments on eight public datasets, which demonstrate +the advantage of our approach. + +摘要:鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而,其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題:我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習?一方面,與自然語言不同,圖形是非線性的,並且具有複雜的拓撲結構。另一方面,許多圖形缺乏文本數據,這使得難以制定基於語言的 CoT 提示。在這項工作中,我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說,我們將每個下游任務的適應過程分解為一系列推理步驟,每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示,但具體機制卻有很大不同。具體來說,在每一步中,一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後,我們聚合編碼器的隱藏層以構建一個「思想」,它捕獲了當前步驟中每個節點的工作狀態。基於這個思想,我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中,重複這個循環。為了評估和分析 GCoT 的有效性,我們對八個公共數據集進行了全面的實驗,這證明了我們方法的優勢。 + +##### **Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach** +2502.10453v1 by Régnier Avice, Bernhard Haslhofer, Zhidong Li, Jianlong Zhou + +Attribution tags form the foundation of modern cryptoasset forensics. +However, inconsistent or incorrect tags can mislead investigations and even +result in false accusations. To address this issue, we propose a novel +computational method based on Large Language Models (LLMs) to link attribution +tags with well-defined knowledge graph concepts. We implemented this method in +an end-to-end pipeline and conducted experiments showing that our approach +outperforms baseline methods by up to 37.4% in F1-score across three publicly +available attribution tag datasets. By integrating concept filtering and +blocking procedures, we generate candidate sets containing five knowledge graph +entities, achieving a recall of 93% without the need for labeled data. +Additionally, we demonstrate that local LLM models can achieve F1-scores of +90%, comparable to remote models which achieve 94%. We also analyze the +cost-performance trade-offs of various LLMs and prompt templates, showing that +selecting the most cost-effective configuration can reduce costs by 90%, with +only a 1% decrease in performance. Our method not only enhances attribution tag +quality but also serves as a blueprint for fostering more reliable forensic +evidence. + +摘要:歸因標籤構成現代加密資產鑑識的基礎。 +然而,不一致或不正確的標籤會誤導調查,甚至導致錯誤的指控。為了解決這個問題,我們提出了一種基於大型語言模型 (LLM) 的新型計算方法,將歸因標籤與定義明確的知識圖譜概念連結起來。我們在端到端管道中實施了這種方法,並進行了實驗,結果顯示我們的做法在三個公開可用的歸因標籤資料集中,F1 分數比基線方法高出 37.4%。透過整合概念過濾和封鎖程序,我們生成了包含五個知識圖譜實體的候選集,在不需要標籤資料的情況下,達到了 93% 的召回率。 +此外,我們證明了本機 LLM 模型可以達到 90% 的 F1 分數,與達到 94% 的遠端模型相當。我們也分析了各種 LLM 和提示範本的成本效益權衡,結果顯示選擇最具成本效益的設定可以將成本降低 90%,而效能只下降 1%。我們的做法不僅提升了歸因標籤的品質,也作為促進更可靠鑑識證據的藍圖。 + +##### **Deep Semantic Graph Learning via LLM based Node Enhancement** +2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen + +Graph learning has attracted significant attention due to its widespread +real-world applications. Current mainstream approaches rely on text node +features and obtain initial node embeddings through shallow embedding learning +using GNNs, which shows limitations in capturing deep textual semantics. Recent +advances in Large Language Models (LLMs) have demonstrated superior +capabilities in understanding text semantics, transforming traditional text +feature processing. This paper proposes a novel framework that combines Graph +Transformer architecture with LLM-enhanced node features. Specifically, we +leverage LLMs to generate rich semantic representations of text nodes, which +are then processed by a multi-head self-attention mechanism in the Graph +Transformer to capture both local and global graph structural information. Our +model utilizes the Transformer's attention mechanism to dynamically aggregate +neighborhood information while preserving the semantic richness provided by LLM +embeddings. Experimental results demonstrate that the LLM-enhanced node +features significantly improve the performance of graph learning models on node +classification tasks. This approach shows promising results across multiple +graph learning tasks, offering a practical direction for combining graph +networks with language models. + +摘要:圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵,並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入,這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力,轉換了傳統的文本特徵處理。本文提出了一種新的框架,將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說,我們利用 LLM 來生成文本節點的豐富語義表示,然後在圖形轉換器中由多頭自我注意機制處理,以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息,同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明,LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果,為將圖形網絡與語言模型相結合提供了實用的方向。 + +##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping** +2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia + +The prototyping of computer games, particularly card games, requires +extensive human effort in creative ideation and gameplay evaluation. Recent +advances in Large Language Models (LLMs) offer opportunities to automate and +streamline these processes. However, it remains challenging for LLMs to design +novel game mechanics beyond existing databases, generate consistent gameplay +environments, and develop scalable gameplay AI for large-scale evaluations. +This paper addresses these challenges by introducing a comprehensive automated +card game prototyping framework. The approach highlights a graph-based indexing +method for generating novel game designs, an LLM-driven system for consistent +game code generation validated by gameplay records, and a gameplay AI +constructing method that uses an ensemble of LLM-generated action-value +functions optimized through self-play. These contributions aim to accelerate +card game prototyping, reduce human labor, and lower barriers to entry for game +developers. + +摘要:電腦遊戲,尤其是卡牌遊戲的原型製作,需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而,LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境,以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法,用於生成新穎的遊戲設計,一個由 LLM 驅動的系統,用於一致的遊戲程式碼生成,並由遊戲記錄驗證,以及一個遊戲 AI 構建方法,該方法使用由 LLM 生成的動作值函數的集合,通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作,減少人力,並降低遊戲開發人員的進入門檻。 + +##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units** +2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan + +Graph Neural Networks (GNNs) are vital for learning from graph-structured +data, enabling applications in network analysis, recommendation systems, and +speech analytics. Deploying them on edge devices like client PCs and laptops +enhances real-time processing, privacy, and cloud independence. GNNs aid +Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and +enable event-based vision tasks. However, irregular memory access, sparsity, +and dynamic structures cause high latency and energy overhead on +resource-constrained devices. While modern edge processors integrate CPUs, +GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular +GNN computations. We introduce GraNNite, the first hardware-aware framework +optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN +accelerators via a structured three-step methodology: (1) enabling NPU +execution, (2) optimizing performance, and (3) trading accuracy for efficiency +gains. Step 1 employs GraphSplit for workload distribution and StaGr for static +aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts +performance using EffOp for control-heavy tasks and GraSp for sparsity +exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce +redundancy and memory transfers. Step 3 balances quality versus efficiency, +where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate +attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, +GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to +8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher +performance than CPUs and GPUs, respectively, across GNN models. + +摘要:圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要,能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置(例如用戶端電腦和筆電)上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG),並支援基於事件的視覺任務。然而,不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU,但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite,這是第一個硬體感知框架,透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行:(1) 啟用 NPU 執行,(2) 最佳化效能,以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配,並使用 StaGr 進行靜態聚合,而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能,並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率,其中 QuantGr 適用 INT8 量化,而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上,GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速,在 CPU 和 GPU 上實現了高達 8.6X 的能源增益,在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。 + +##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language** +2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin + +Recent advancements in AI for biological research focus on integrating +molecular data with natural language to accelerate drug discovery. However, the +scarcity of high-quality annotations limits progress in this area. This paper +introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework +that leverages large language models to augment existing datasets, thereby +improving AI training. We demonstrate the effectiveness of LA$^3$ by creating +an enhanced dataset, LaChEBI-20, where we systematically rewrite the +annotations of molecules from an established dataset. These rewritten +annotations preserve essential molecular information while providing more +varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 +based on a benchmark architecture to learn the mapping between molecular +representations and augmented annotations. + Experimental results on text-based *de novo* molecule generation and molecule +captioning demonstrate that LaMolT5 outperforms state-of-the-art models. +Notably, incorporating LA$^3$ leads to improvements of up to 301% over the +benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ +notable applications in *image*, *text* and *graph* tasks, affirming its +versatility and utility. + +摘要:人工智慧在生物研究上的最新進展,專注於將分子資料與自然語言整合,以加速藥物發現。然而,高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$,一個基於語言的自動註解擴充框架,它利用大型語言模型來擴充現有的資料集,進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性,我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊,同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20,我們在基於基準架構上訓練 LaMolT5,以學習分子表示和擴充註解之間的對應。 +在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明,LaMolT5 優於最先進的模型。值得注意的是,納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外,我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性,肯定了它的多功能性和實用性。 + +##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment** +2502.06472v1 by Yuxing Lu, Jinzhuo Wang + +Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical +for modern AI systems, but manual curation struggles to scale with the rapid +growth of scientific literature. This paper presents KARMA, a novel framework +employing multi-agent large language models (LLMs) to automate KG enrichment +through structured analysis of unstructured text. Our approach employs nine +collaborative agents, spanning entity discovery, relation extraction, schema +alignment, and conflict resolution that iteratively parse documents, verify +extracted knowledge, and integrate it into existing graph structures while +adhering to domain-specific schema. Experiments on 1,200 PubMed articles from +three different domains demonstrate the effectiveness of KARMA in knowledge +graph enrichment, with the identification of up to 38,230 new entities while +achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% +through multi-layer assessments. + +摘要:維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要,但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA,一個採用多代理大型語言模型 (LLM) 的新框架,透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理,涵蓋實體發現、關係提取、架構比對和衝突解決,這些代理會反覆分析文件、驗證提取的知識,並將其整合到現有的圖結構中,同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性,識別出多達 38,230 個新實體,同時達到 83.1% 的 LLM 驗證正確性,並透過多層評估將衝突邊緣降低了 18.6%。 + +##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs** +2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang + +Mitigating positional bias of language models (LMs) for listwise inputs is a +well-known and important problem (e.g., lost-in-the-middle). While zero-shot +order-invariant LMs have been proposed to solve this issue, their success on +practical listwise problems has been limited. In this work, as a first +contribution, we identify and overcome two limitations to make zero-shot +invariant LMs more practical: (1) training and inference distribution mismatch +arising from modifying positional ID assignments to enforce invariance, and (2) +failure to adapt to a mixture of order-invariant and sensitive inputs in +practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot +invariant LM for genuinely order-invariant inputs with minimal modifications of +positional IDs, and (2) Selective Routing, an adaptive framework that handles +both order-invariant and order-sensitive inputs in listwise tasks. On the Lost +in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU +benchmarks, we show that RoToR with Selective Routing can effectively handle +practical listwise input tasks in a zero-shot manner. + +摘要:語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題(例如,迷失在中間)。雖然已經提出零次學習順序不變的 LM 來解決這個問題,但它們在實際列表問題上的成功卻很有限。在這項工作中,作為第一個貢獻,我們找出並克服了兩個限制,讓零次學習不變的 LM 更有實用性:(1) 訓練和推論分布不匹配,這是由於修改位置 ID 分配以強制不變性所造成的,以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題,我們提出 (1) RoToR,一個零次學習不變的 LM,用於真正不變的輸入,並對位置 ID 進行最小的修改,以及 (2) 選擇性路由,一個自適應框架,用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中,我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。 + +##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model** +2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen + +Recent advancements in large language models (LLMs) have significantly +improved various natural language processing (NLP) tasks. Typically, LLMs are +trained to predict the next token, aligning well with many NLP tasks. However, +in knowledge graph (KG) scenarios, entities are the fundamental units and +identifying an entity requires at least several tokens. This leads to a +granularity mismatch between KGs and natural languages. To address this issue, +we propose K-ON, which integrates KG knowledge into the LLM by employing +multiple head layers for next k-step prediction. K-ON can not only generate +entity-level results in one step, but also enables contrastive loss against +entities, which is the most powerful tool in KG representation learning. +Experimental results show that K-ON outperforms state-of-the-art methods that +incorporate text and even the other modalities. + +摘要:大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常,LLM 會接受訓練以預測下一個符號,這與許多 NLP 任務非常吻合。然而,在知識圖譜 (KG) 場景中,實體是基本單位,而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題,我們提出了 K-ON,它透過採用多個頭部層進行下一個 k 步預測,將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果,還能針對實體啟用對比損失,這是 KG 表示學習中最有力的工具。實驗結果顯示,K-ON 優於將文字甚至其他方式納入考量的最新方法。 + +##### **LegalViz: Legal Text Visualization by Text To Diagram Generation** +2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita + +Legal documents including judgments and court orders require highly +sophisticated legal knowledge for understanding. To disclose expert knowledge +for non-experts, we explore the problem of visualizing legal texts with +easy-to-understand diagrams and propose a novel dataset of LegalViz with 23 +languages and 7,010 cases of legal document and visualization pairs, using the +DOT graph description language of Graphviz. LegalViz provides a simple diagram +from a complicated legal corpus identifying legal entities, transactions, legal +sources, and statements at a glance, that are essential in each judgment. In +addition, we provide new evaluation metrics for the legal diagram visualization +by considering graph structures, textual similarities, and legal contents. We +conducted empirical studies on few-shot and finetuning large language models +for generating legal diagrams and evaluated them with these metrics, including +legal content-based evaluation within 23 languages. Models trained with +LegalViz outperform existing models including GPTs, confirming the +effectiveness of our dataset. + +摘要:法律文件,包括判決和法院命令,需要高度專業的法律知識才能理解。為了向非專家揭露專家知識,我們探討了使用易於理解的圖表將法律文本視覺化的問題,並提出了一個新的 LegalViz 數據集,其中包含 23 種語言和 7,010 個法律文件和視覺化配對,使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表,可以一目了然地識別法律實體、交易、法律來源和陳述,這些在每項判決中都是必不可少的。此外,我們通過考慮圖形結構、文本相似性和法律內容,為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究,以生成法律圖表,並使用這些指標對它們進行了評估,包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型,包括 GPT,證實了我們數據集的有效性。 + +##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs** +2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee + +Mental-illness stigma is a persistent social problem, hampering both +treatment-seeking and recovery. Accordingly, there is a pressing need to +understand it more clearly, but analyzing the relevant data is highly +labor-intensive. Therefore, we designed a chatbot to engage participants in +conversations; coded those conversations qualitatively with AI assistance; and, +based on those coding results, built causal knowledge graphs to decode stigma. +The results we obtained from 1,002 participants demonstrate that conversation +with our chatbot can elicit rich information about people's attitudes toward +depression, while our AI-assisted coding was strongly consistent with +human-expert coding. Our novel approach combining large language models (LLMs) +and causal knowledge graphs uncovered patterns in individual responses and +illustrated the interrelationships of psychological constructs in the dataset +as a whole. The paper also discusses these findings' implications for HCI +researchers in developing digital interventions, decomposing human +psychological constructs, and fostering inclusive attitudes. + +摘要:精神疾病的污名化是一個持續存在的社會問題,阻礙了尋求治療和康復。因此,迫切需要更清楚地了解它,但分析相關數據非常費力。因此,我們設計了一個聊天機器人,讓參與者參與對話;使用 AI 協助對這些對話進行定性編碼;並根據這些編碼結果,構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明,與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊,而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式,並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。 + +##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification** +2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya + +In this paper, we address the task of semantic segmentation of legal +documents through rhetorical role classification, with a focus on Indian legal +judgments. We introduce LegalSeg, the largest annotated dataset for this task, +comprising over 7,000 documents and 1.4 million sentences, labeled with 7 +rhetorical roles. To benchmark performance, we evaluate multiple +state-of-the-art models, including Hierarchical BiLSTM-CRF, +TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and +Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an +instruction-tuned large language model. Our results demonstrate that models +incorporating broader context, structural relationships, and sequential +sentence information outperform those relying solely on sentence-level +features. Additionally, we conducted experiments using surrounding context and +predicted or actual labels of neighboring sentences to assess their impact on +classification accuracy. Despite these advancements, challenges persist in +distinguishing between closely related roles and addressing class imbalance. +Our work underscores the potential of advanced techniques for improving legal +document understanding and sets a strong foundation for future research in +legal NLP. + +摘要:在本文中,我們通過修辭角色分類來探討法律文件的語義分段任務,重點關注印度法律判決。我們引入了 LegalSeg,這是此任務中最大的註釋資料集,包含超過 7,000 份文件和 140 萬個句子,並標記了 7 個修辭角色。為了評量效能,我們評估了多個最先進的模型,包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer,以及探索性的 RhetoricLLaMA,一種經過指令調整的大型語言模型。我們的結果表明,結合廣泛背景、結構關係和順序句子資訊的模型,表現優於僅依賴句子層級特徵的模型。此外,我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗,以評估它們對分類精度的影響。儘管有這些進展,但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力,並為法律自然語言處理的未來研究奠定了堅實的基礎。 + +##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning** +2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong + +Developing intelligent agents for long-term cooperation in dynamic open-world +scenarios is a major challenge in multi-agent systems. Traditional Multi-agent +Reinforcement Learning (MARL) frameworks like centralized training +decentralized execution (CTDE) struggle with scalability and flexibility. They +require centralized long-term planning, which is difficult without custom +reward functions, and face challenges in processing multi-modal data. CTDE +approaches also assume fixed cooperation strategies, making them impractical in +dynamic environments where agents need to adapt and plan independently. To +address decentralized multi-agent cooperation, we propose Decentralized +Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in +a novel Multi-agent Crafter environment. Our generative agents, powered by +Large Language Models (LLMs), are more scalable than traditional MARL agents by +leveraging external knowledge and language for long-term planning and +reasoning. Instead of fully sharing information from all past experiences, +DAMCS introduces a multi-modal memory system organized as a hierarchical +knowledge graph and a structured communication protocol to optimize agent +cooperation. This allows agents to reason from past interactions and share +relevant information efficiently. Experiments on novel multi-agent open-world +tasks show that DAMCS outperforms both MARL and LLM baselines in task +efficiency and collaboration. Compared to single-agent scenarios, the two-agent +scenario achieves the same goal with 63% fewer steps, and the six-agent +scenario with 74% fewer steps, highlighting the importance of adaptive memory +and structured communication in achieving long-term goals. We publicly release +our project at: https://happyeureka.github.io/damcs. + +摘要:在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架,例如集中式訓練去中心化執行 (CTDE),在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃,這在沒有自訂獎勵函數的情況下很難執行,並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略,這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題,我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援,透過利用外部知識和語言進行長期規劃和推理,比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊,而是引入了多模式記憶體系統,該系統組織成階層式知識圖譜和結構化通訊協定,以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明,DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比,雙重代理情境以少 63% 的步驟達成相同的目標,而六重代理情境則以少 74% 的步驟達成目標,突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於:https://happyeureka.github.io/damcs。 + +##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation** +2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang + +Graphs are able to model interconnected entities in many online services, +supporting a wide range of applications on the Web. This raises an important +question: How can we train a graph foundational model on multiple source +domains and adapt to an unseen target domain? A major obstacle is that graphs +from different domains often exhibit divergent characteristics. Some studies +leverage large language models to align multiple domains based on textual +descriptions associated with the graphs, limiting their applicability to +text-attributed graphs. For text-free graphs, a few recent works attempt to +align different feature distributions across domains, while generally +neglecting structural differences. In this work, we propose a novel Structure +Alignment framework for text-free Multi-domain Graph Pre-Training and +cross-domain adaptation (SAMGPT). It is designed to learn multi-domain +knowledge from graphs originating in multiple source domains, which can then be +adapted to address applications in an unseen target domain. Specifically, we +introduce a set of structure tokens to harmonize structure-based aggregation +across source domains during the pre-training phase. Next, for cross-domain +adaptation, we design dual prompts, namely, holistic prompts and specific +prompts, which adapt unified multi-domain structural knowledge and +fine-grained, domain-specific information, respectively, to a target domain. +Finally, we conduct comprehensive experiments on seven public datasets to +evaluate and analyze the effectiveness of SAMGPT. + +摘要:圖表能夠在許多線上服務中對相互關聯的實體進行建模, +支援網路上廣泛的應用程式。這提出了重要的問題:我們如何針對多個來源網域訓練圖表基礎模型,並適應未見過的目標網域?一個主要的障礙是,來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型,根據與圖表相關的文字描述,對齊多個網域,限制其適用性於有文字屬性的圖表。對於沒有文字的圖表,最近的一些作品嘗試對齊跨網域的不同特徵分佈,同時通常忽略結構上的差異。在這項工作中,我們提出了一個新的結構對齊框架,用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識,然後可以適應於未見過的目標網域中的應用程式。具體來說,我們引入了一組結構化代碼,以在預訓練階段,調和跨來源網域的基於結構的聚合。接下來,對於跨網域適應,我們設計了雙重提示,即整體提示和具體提示,分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後,我們在七個公共資料集上進行了全面的實驗,以評估和分析 SAMGPT 的有效性。 + +##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints** +2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang + +In-context learning (ICL) effectively conditions large language models (LLMs) +for molecular tasks, such as property prediction and molecule captioning, by +embedding carefully selected demonstration examples into the input prompt. This +approach avoids the computational overhead of extensive pertaining and +fine-tuning. However, current prompt retrieval methods for molecular tasks have +relied on molecule feature similarity, such as Morgan fingerprints, which do +not adequately capture the global molecular and atom-binding relationships. As +a result, these methods fail to represent the full complexity of molecular +structures during inference. Moreover, small-to-medium-sized LLMs, which offer +simpler deployment requirements in specialized systems, have remained largely +unexplored in the molecular ICL literature. To address these gaps, we propose a +self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context +learning, which aligns global molecular structures, represented by graph neural +networks (GNNs), with textual captions (descriptions) while leveraging local +feature similarity through Morgan fingerprints. In addition, we introduce a +Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to +optimize input prompt demonstration samples. Our experimental findings using +diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL +retrieval methods across all tasks by up to 45%. + +摘要:情境學習 (ICL) 有效地調整大型語言模型 (LLM),以執行分子任務,例如屬性預測和分子標題,方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而,目前針對分子任務的提示檢索方法依賴於分子特徵相似性,例如 Morgan 指紋,而無法充分捕捉全局分子和原子鍵結關係。因此,這些方法無法在推理過程中表示分子結構的完整複雜性。此外,在專業系統中提供更簡單部署需求的小到中型的 LLM,在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距,我們提出了一種自我監督學習技術,GAMIC(圖形對齊分子情境學習),它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題(描述)對齊,同時透過 Morgan 指紋利用局部特徵相似性。此外,我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法,以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示,GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法,最多可達 45%。 + +##### **Knowledge Graph-Guided Retrieval Augmented Generation** +2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu + +Retrieval-augmented generation (RAG) has emerged as a promising technology +for addressing hallucination issues in the responses generated by large +language models (LLMs). Existing studies on RAG primarily focus on applying +semantic-based approaches to retrieve isolated relevant chunks, which ignore +their intrinsic relationships. In this paper, we propose a novel Knowledge +Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes +knowledge graphs (KGs) to provide fact-level relationships between chunks, +improving the diversity and coherence of the retrieved results. Specifically, +after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG +employs a KG-guided chunk expansion process and a KG-based chunk organization +process to deliver relevant and important knowledge in well-organized +paragraphs. Extensive experiments conducted on the HotpotQA dataset and its +variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based +approaches, in terms of both response quality and retrieval quality. + +摘要:檢索增強生成 (RAG) 已成為一項有前途的技術,用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊,而忽略它們的內在關係。在本文中,我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架,它利用知識圖表 (KG) 來提供區塊之間的事實層級關係,從而提高檢索結果的多樣性和一致性。具體來說,在執行基於語義的檢索以提供種子區塊後,KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序,以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。 + +##### **Can Large Language Models Understand Intermediate Representations?** +2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan + +Intermediate Representations (IRs) are essential in compiler design and +program analysis, yet their comprehension by Large Language Models (LLMs) +remains underexplored. This paper presents a pioneering empirical study to +investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA +3.1, and Code Llama, in understanding IRs. We analyze their performance across +four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code +summarization, and execution reasoning. Our results indicate that while LLMs +demonstrate competence in parsing IR syntax and recognizing high-level +structures, they struggle with control flow reasoning, execution semantics, and +loop handling. Specifically, they often misinterpret branching instructions, +omit critical IR operations, and rely on heuristic-based reasoning, leading to +errors in CFG reconstruction, IR decompilation, and execution reasoning. The +study underscores the necessity for IR-specific enhancements in LLMs, +recommending fine-tuning on structured IR datasets and integration of explicit +control flow models to augment their comprehension and handling of IR-related +tasks. + +摘要:中間表徵 (IR) 在編譯器設計和程式分析中至關重要,但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究,以探討 LLM(包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama)理解 IR 的能力。我們分析了它們在四項任務中的表現:控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明,儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力,但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言,它們經常誤解分支指令、省略關鍵 IR 操作,並依賴於基於啟發式的推理,導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性,建議對結構化的 IR 資料集進行微調,並整合明確的控制流程模型,以增強其對 IR 相關任務的理解和處理。 + +##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?** +2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen + +Long-context large language models (LLMs) have recently shown strong +performance in information retrieval and long-document QA. However, to tackle +the most challenging intellectual problems, LLMs must reason effectively in +long and complex contexts (e.g., frontier mathematical research). Studying how +LLMs handle increasing reasoning complexity and context length is essential, +yet existing benchmarks lack a solid basis for quantitative evaluation. +Inspired by the abstraction of GSM-8K problems as computational graphs, and the +ability to introduce noise by adding unnecessary nodes and edges, we develop a +grade school math problem generator capable of producing arithmetic problems +with infinite difficulty and context length under fine-grained control. Using +our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate +existing LLMs. We find a consistent sigmoid decline in reasoning performance as +complexity increases, along with a systematic inference scaling trend: +exponentially increasing inference computation yields only linear performance +gains. These findings underscore the fundamental limitations of current +long-context LLMs and the key challenges in scaling reasoning capabilities. Our +GSM-Infinite benchmark provides a scalable and controllable testbed for +systematically studying and advancing LLM reasoning in long and complex +contexts. + +摘要:長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而,若要解決最具挑戰性的智力問題,LLM 必須在長且複雜的脈絡中有效推理(例如,前沿數學研究)。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要,但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發,以及透過加入不必要的節點和邊緣來引入雜訊的能力,我們開發了一個小學數學問題產生器,能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準,我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降,並伴隨著系統性的推論縮放趨勢:指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制,以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台,用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。 + +##### **Causality can systematically address the monsters under the bench(marks)** +2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf + +Effective and reliable evaluation is essential for advancing empirical +machine learning. However, the increasing accessibility of generalist models +and the progress towards ever more complex, high-level tasks make systematic +evaluation more challenging. Benchmarks are plagued by various biases, +artifacts, or leakage, while models may behave unreliably due to poorly +explored failure modes. Haphazard treatments and inconsistent formulations of +such "monsters" can contribute to a duplication of efforts, a lack of trust in +results, and unsupported inferences. In this position paper, we argue causality +offers an ideal framework to systematically address these challenges. By making +causal assumptions in an approach explicit, we can faithfully model phenomena, +formulate testable hypotheses with explanatory power, and leverage principled +tools for analysis. To make causal model design more accessible, we identify +several useful Common Abstract Topologies (CATs) in causal graphs which help +gain insight into the reasoning abilities in large language models. Through a +series of case studies, we demonstrate how the precise yet pragmatic language +of causality clarifies the strengths and limitations of a method and inspires +new approaches for systematic progress. + +摘要:有效的、可靠的評估對於推進經驗機器學習至關重要。然而,一般化模型的可及性日益提高,以及朝著更複雜、更高級別任務的進展,使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾,而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中,我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設,我們可以忠實地模擬現象,制定具有解釋力的可測試假設,並利用原則性的分析工具。為了使因果模型設計更易於使用,我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT),有助於深入了解大型語言模型中的推理能力。通過一系列案例研究,我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點,並激發系統進展的新方法。 + +##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures** +2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha + +Large Language Models (LLMs) have demonstrated impressive reasoning +capabilities, yet their performance is highly dependent on the prompting +strategy and model scale. While reinforcement learning and fine-tuning have +been deployed to boost reasoning, these approaches incur substantial +computational and data overhead. In this work, we introduce Adaptive Graph of +Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM +reasoning solely at test time. Rather than relying on fixed-step methods like +Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes +complex queries into structured subproblems, forming an dynamic directed +acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding +only those subproblems that require further analysis, AGoT unifies the +strengths of chain, tree, and graph paradigms into a cohesive framework that +allocates computation where it is most needed. We validate our approach on +diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and +mathematical problem-solving, achieving up to 46.2% improvement on scientific +reasoning tasks (GPQA) - comparable to gains achieved through computationally +intensive reinforcement learning approaches and outperforming state-of-the-art +iterative approaches. These results suggest that dynamic decomposition and +structured recursion offer a scalable, cost-effective alternative to +post-training modifications, paving the way for more robust, general-purpose +reasoning in LLMs. + +摘要:大型語言模型 (LLM) 已展現令人印象深刻的推理能力,但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理,但這些方法會造成大量的運算和資料開銷。在這項工作中,我們引入了「適應性思考圖」(AGoT),一個動態的、基於圖形的推論架構,它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法,而是遞迴地將複雜的查詢分解成結構化的子問題,形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題,AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中,將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法,在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進,這與透過運算密集的強化學習方法所獲得的增益相當,並且優於最先進的迭代方法。這些結果表明,動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案,用於訓練後修改,為 LLM 中更強健、更通用的推理鋪平了道路。 + +##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics** +2502.05239v1 by Hussam Ghanem, Christophe Cruz + +Recent advancements in large language models have demonstrated significant +potential in the automated construction of knowledge graphs from unstructured +text. This paper builds upon our previous work [16], which evaluated various +models using metrics like precision, recall, F1 score, triple matching, and +graph matching, and introduces a refined approach to address the critical +issues of hallucination and omission. We propose an enhanced evaluation +framework incorporating BERTScore for graph similarity, setting a practical +threshold of 95% for graph matching. Our experiments focus on the Mistral +model, comparing its original and fine-tuned versions in zero-shot and few-shot +settings. We further extend our experiments using examples from the KELM-sub +training dataset, illustrating that the fine-tuned model significantly improves +knowledge graph construction accuracy while reducing the exact hallucination +and omission. However, our findings also reveal that the fine-tuned models +perform worse in generalization tasks on the KELM-sub dataset. This study +underscores the importance of comprehensive evaluation metrics in advancing the +state-of-the-art in knowledge graph construction from textual data. + +摘要:大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上,該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型,並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架,結合 BERTScore 來進行圖形相似性,並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上,比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗,說明微調後的模型顯著提高了知識圖譜建構的準確度,同時減少了精確的幻覺和遺漏。然而,我們的研究結果也顯示,微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。 + +##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research** +2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu + +We introduce Agentic Reasoning, a framework that enhances large language +model (LLM) reasoning by integrating external tool-using agents. Unlike +conventional LLM-based reasoning approaches, which rely solely on internal +inference, Agentic Reasoning dynamically engages web search, code execution, +and structured reasoning-context memory to solve complex problems requiring +deep research and multi-step logical deduction. Our framework introduces the +Mind Map agent, which constructs a structured knowledge graph to track logical +relationships, improving deductive reasoning. Additionally, the integration of +web-search and coding agents enables real-time retrieval and computational +analysis, enhancing reasoning accuracy and decision-making. Evaluations on +PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks +demonstrate that our approach significantly outperforms existing models, +including leading retrieval-augmented generation (RAG) systems and +closed-source LLMs. Moreover, our results indicate that agentic reasoning +improves expert-level knowledge synthesis, test-time scalability, and +structured problem-solving. The code is at: +https://github.com/theworldofagents/Agentic-Reasoning. + +摘要:我們引入了代理推理,一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同,代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理,它建立一個結構化的知識圖譜來追蹤邏輯關係,改善演繹推理。此外,整合網路搜尋和編碼代理能進行即時擷取和運算分析,增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示,我們的做法明顯優於現有模型,包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外,我們的結果顯示,代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在:https://github.com/theworldofagents/Agentic-Reasoning。 + +##### **Position-aware Automatic Circuit Discovery** +2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov + +A widely used strategy to discover and understand language model mechanisms +is circuit analysis. A circuit is a minimal subgraph of a model's computation +graph that executes a specific task. We identify a gap in existing circuit +discovery methods: they assume circuits are position-invariant, treating model +components as equally relevant across input positions. This limits their +ability to capture cross-positional interactions or mechanisms that vary across +positions. To address this gap, we propose two improvements to incorporate +positionality into circuits, even on tasks containing variable-length examples. +First, we extend edge attribution patching, a gradient-based method for circuit +discovery, to differentiate between token positions. Second, we introduce the +concept of a dataset schema, which defines token spans with similar semantics +across examples, enabling position-aware circuit discovery in datasets with +variable length examples. We additionally develop an automated pipeline for +schema generation and application using large language models. Our approach +enables fully automated discovery of position-sensitive circuits, yielding +better trade-offs between circuit size and faithfulness compared to prior work. + +摘要:廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖,可執行特定任務。我們找出電路發現方法中的一個缺口:它們假設電路與位置無關,將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口,我們提出兩項改進,將位置性納入電路中,即使在包含變長範例的任務中也是如此。首先,我們擴充邊緣屬性修補,一種基於梯度的電路發現方法,以區分符號位置。其次,我們引入了資料集架構的概念,它定義了在範例中具有類似語義的符號跨距,使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外,我們開發了一個自動化管線,用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化,與先前的研究相比,在電路大小和忠實度之間產生了更好的權衡。 + +##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems** +2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister + +We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by +jointly optimizing model roles and weights. We represent multi-LLM systems as +directed acyclic graphs (DAGs) of LLMs with topological message passing for +collaborative generation. Given a pool of LLM experts and a utility function, +Heterogeneous Swarms employs two iterative steps: role-step and weight-step. +For role-step, we interpret model roles as learning a DAG that specifies the +flow of inputs and outputs between LLMs. Starting from a swarm of random +continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs +in topological order, evaluate on the utility function (e.g. accuracy on a +task), and optimize the adjacency matrices with particle swarm optimization +based on the utility score. For weight-step, we assess the contribution of +individual LLMs in the multi-LLM systems and optimize model weights with swarm +intelligence. We propose JFK-score to quantify the individual contribution of +each LLM in the best-found DAG of the role-step, then optimize model weights +with particle swarm optimization based on the JFK-score. Experiments +demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based +baselines by 18.5% on average across 12 tasks. Further analysis reveals that +Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles +and substantial collaborative gains, and benefits from the diversity of +language models. + +摘要:我們提出異質群體,一種演算法,透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG),並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數,異質群體使用兩個反覆步驟:角色步驟和權重步驟。對於角色步驟,我們將模型角色解釋為學習一個 DAG,它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始,我們將它們解碼為離散 DAG,以拓撲順序呼叫 LLM,根據效用函數(例如任務的準確度)進行評估,並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟,我們評估個別 LLM 在多 LLM 系統中的貢獻,並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻,然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明,異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明,異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統,並受益於語言模型的多樣性。 + +##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot** +2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao + +Retrieval-augmented generation (RAG) is a well-suited technique for +retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a +key module of the healthcare copilot, helping reduce misdiagnosis for +healthcare practitioners and patients. However, the diagnostic accuracy and +specificity of existing heuristic-based RAG models used in the medical domain +are inadequate, particularly for diseases with similar manifestations. This +paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited +reasoning for the medical domain that retrieves diagnosis and treatment +recommendations based on manifestations. MedRAG systematically constructs a +comprehensive four-tier hierarchical diagnostic KG encompassing critical +diagnostic differences of various diseases. These differences are dynamically +integrated with similar EHRs retrieved from an EHR database, and reasoned +within a large language model. This process enables more accurate and specific +decision support, while also proactively providing follow-up questions to +enhance personalized medical decision-making. MedRAG is evaluated on both a +public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) +collected from Tan Tock Seng Hospital, and its performance is compared against +various existing RAG methods. Experimental results show that, leveraging the +information integration and relational abilities of the KG, our MedRAG provides +more specific diagnostic insights and outperforms state-of-the-art models in +reducing misdiagnosis rates. Our code will be available at +https://github.com/SNOWTEAM2023/MedRAG + +摘要:檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組,協助減少醫療保健從業人員和患者的誤診。然而,在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足,特別是對於具有類似表現的疾病。本文提出 MedRAG,一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型,用於醫療領域,它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG,涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合,並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援,同時主動提供後續問題,以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估,並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示,利用 KG 的資訊整合和關係能力,我們的 MedRAG 提供了更具體的診斷見解,並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供 + +##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering** +2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck + +Most existing Knowledge Graph Question Answering (KGQA) approaches are +designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the +heterogeneity of the underlying graph schema, topology and assertions, most +KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without +resource-intensive training data. We present OntoSCPrompt, a novel Large +Language Model (LLM)-based KGQA approach with a two-stage architecture that +separates semantic parsing from KG-dependent interactions. OntoSCPrompt first +generates a SPARQL query structure (including SPARQL keywords such as SELECT, +ASK, WHERE and placeholders for missing tokens) and then fills them with +KG-specific information. To enhance the understanding of the underlying KG, we +present an ontology-guided, hybrid prompt learning strategy that integrates KG +ontology into the learning process of hybrid prompts (e.g., discrete and +continuous vectors). We also present several task-specific decoding strategies +to ensure the correctness and executability of generated SPARQL queries in both +stages. Experimental results demonstrate that OntoSCPrompt performs as well as +SOTA approaches without retraining on a number of KGQA datasets such as CWQ, +WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well +to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: +\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} + +摘要:現有的知識圖譜問答(KGQA)方法大多是為特定 KG 而設計的,例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性,大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜(KG)。我們提出 OntoSCPrompt,這是一種基於大型語言模型(LLM)的新型 KGQA 方法,採用兩階段架構,將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構(包括 SPARQL 關鍵字,例如 SELECT、ASK、WHERE 和缺失令牌的佔位符),然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解,我們提出了一種由本体指導的混合提示學習策略,將 KG 本体整合到混合提示(例如,離散和連續向量)的學習過程中。我們還提出了多種特定任務的解碼策略,以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明,OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時,效能與 SOTA 方法一樣好,且資源使用效率高,並且可以很好地概括到未見過的特定領域 KG,例如 DBLP-QuAD 和 CoyPu KG Code: +\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} + +##### **Multimodal Medical Code Tokenizer** +2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik + +Foundation models trained on patient electronic health records (EHRs) require +tokenizing medical data into sequences of discrete vocabulary items. Existing +tokenizers treat medical codes from EHRs as isolated textual tokens. However, +each medical code is defined by its textual description, its position in +ontological hierarchies, and its relationships to other codes, such as disease +co-occurrences and drug-treatment associations. Medical vocabularies contain +more than 600,000 codes with critical information for clinical reasoning. We +introduce MedTok, a multimodal medical code tokenizer that uses the text +descriptions and relational context of codes. MedTok processes text using a +language model encoder and encodes the relational structure with a graph +encoder. It then quantizes both modalities into a unified token space, +preserving modality-specific and cross-modality information. We integrate +MedTok into five EHR models and evaluate it on operational and clinical tasks +across in-patient and out-patient datasets, including outcome prediction, +diagnosis classification, drug recommendation, and risk stratification. +Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR +models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with +the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate +using MedTok tokenizer with medical QA systems. Our results demonstrate the +potential of MedTok as a unified tokenizer for medical codes, improving +tokenization for medical foundation models. + +摘要:在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而,每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系(例如疾病共现和药物治疗关联)来定义。医学词汇表包含超过 600,000 个代码,这些代码包含临床推理的关键信息。我们引入了 MedTok,这是一种多模态医学代码标记器,它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本,并使用图编码器对关系结构进行编码。然后,它将这两种模态量化为一个统一的标记空间,保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中,并在住院和门诊数据集(包括结果预测、诊断分类、药物推荐和风险分层)上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC,在 MIMIC-III 上提高 4.10%,在 MIMIC-IV 上提高 4.78%,在 EHRShot 上提高 11.30%,其中药物推荐的增益最大。除了 EHR 建模之外,我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力,改进了医学基础模型的标记化。 + +##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents** +2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu + +The rapid expansion of web content has made on-device AI assistants +indispensable for helping users manage the increasing complexity of online +tasks. The emergent reasoning ability in large language models offer a +promising path for next-generation on-device AI agents. However, deploying +full-scale Large Language Models (LLMs) on resource-limited local devices is +challenging. In this paper, we propose Division-of-Thoughts (DoT), a +collaborative reasoning framework leveraging the synergy between locally +deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT +leverages a Task Decomposer to elicit the inherent planning abilities in +language models to decompose user queries into smaller sub-tasks, which allows +hybrid language models to fully exploit their respective strengths. Besides, +DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks +and create a dependency graph, facilitating parallel reasoning of sub-tasks and +the identification of key steps. To allocate the appropriate model based on the +difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an +additional task head attached to the SLM that does not alter the SLM's +parameters. To boost adapter's task allocation capability, we propose a +self-reinforced training method that relies solely on task execution feedback. +Extensive experiments on various benchmarks demonstrate that our DoT +significantly reduces LLM costs while maintaining competitive reasoning +accuracy. Specifically, DoT reduces the average reasoning time and API costs by +66.12% and 83.57%, while achieving comparable reasoning accuracy with the best +baseline methods. + +摘要:網頁內容快速擴充,使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而,在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中,我們提出了思想分工 (DoT),一個協作推理框架,利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力,將使用者查詢分解成較小的子任務,這允許混合語言模型充分發揮其各自的優勢。此外,DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖,促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型,DoT 利用了即插即用適配器,這是一個附加在 SLM 上的任務頭,不會改變 SLM 的參數。為了提升適配器的任務分配能力,我們提出了一種自我強化訓練方法,它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明,我們的 DoT 大幅降低了 LLM 成本,同時維持了有競爭力的推理準確度。具體來說,DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%,同時達到了與最佳基準方法相當的推理準確度。 + +##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models** +2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong + +Knowledge Graph-based recommendations have gained significant attention due +to their ability to leverage rich semantic relationships. However, constructing +and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy +of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent +advancements in Large Language Models (LLMs) offer a promising way to improve +the quality and relevance of KGs for recommendation tasks. Despite this, +integrating LLMs into KG-based systems presents challenges, such as efficiently +augmenting KGs, addressing hallucinations, and developing effective joint +learning methods. In this paper, we propose the Confidence-aware KG-based +Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework +that combines KGs and LLMs for recommendation task. The framework includes: (1) +an LLM-based subgraph augmenter for enriching KGs with high-quality +information, (2) a confidence-aware message propagation mechanism to filter +noisy triplets, and (3) a dual-view contrastive learning method to integrate +user-item interactions and KG data. Additionally, we employ a confidence-aware +explanation generation process to guide LLMs in producing realistic +explanations for recommendations. Finally, extensive experiments demonstrate +the effectiveness of CKG-LLMA across multiple public datasets. + +摘要:基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而,構建和維護知識圖譜 (KG) 是一項資源密集型任務,而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此,將 LLM 整合到基於 KG 的系統中會帶來挑戰,例如有效擴充 KG、處理幻覺,以及開發有效的聯合學習方法。在本文中,我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA),這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括:(1) 一個基於 LLM 的子圖擴充器,用於使用高品質資訊豐富 KG,(2) 一個信心感知型訊息傳播機制,用於過濾雜訊三元組,以及 (3) 一個雙視圖對比學習方法,用於整合使用者-項目互動和 KG 資料。此外,我們採用一個信心感知型解釋產生程序,以引導 LLM 為推薦產生逼真的解釋。最後,大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。 + +##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)** +2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell + +Scene graphs have emerged as a structured and serializable environment +representation for grounded spatial reasoning with Large Language Models +(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason +framework for reasoning and planning with scene graphs. Our approach employs +two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and +information queries generation, and a (2) Retriever for extracting +corresponding graph information following the queries. Two agents collaborate +iteratively, enabling sequential reasoning and adaptive attention to graph +information. Unlike prior works, both agents are prompted only with the scene +graph schema rather than the full graph data, which reduces the hallucination +by limiting input tokens, and drives the Reasoner to generate reasoning trace +abstractly.Following the trace, the Retriever programmatically query the scene +graph data based on the schema understanding, allowing dynamic and global +attention on the graph that enhances alignment between reasoning and retrieval. +Through experiments in multiple simulation environments, we show that our +framework surpasses existing LLM-based approaches in numerical Q\&A and +planning tasks, and can benefit from task-level few-shot examples, even in the +absence of agent-level demonstrations. Project code will be released. + +摘要:場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中,我們提出 SG-RwR,一個以綱要為導向的檢索與推理框架,用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理:一個 (1) 推論器,用於任務規劃和資訊查詢產生,以及一個 (2) 檢索器,用於根據查詢提取對應的圖形資訊。兩個代理反覆合作,實現對圖形資訊的順序推理和適應性關注。與先前的作品不同,兩個代理僅提示場景圖表綱要,而不是完整的圖形資料,這透過限制輸入代碼減少了幻覺,並驅使推論器抽象地產生推理軌跡。根據軌跡,檢索器根據綱要理解以程式化方式查詢場景圖形資料,允許對圖形進行動態和整體關注,增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗,我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法,並且可以受益於任務級別的少次範例,即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。 + +##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs** +2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin + +Recent advancements have highlighted that Large Language Models (LLMs) are +prone to hallucinations when solving complex reasoning problems, leading to +erroneous results. To tackle this issue, researchers incorporate Knowledge +Graphs (KGs) to improve the reasoning ability of LLMs. However, existing +methods face two limitations: 1) they typically assume that all answers to the +questions are contained in KGs, neglecting the incompleteness issue of KGs, and +2) they treat the KG as a static repository and overlook the implicit logical +reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an +innovative neural-symbolic agent framework that achieves collaborative +augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments +and transform complex reasoning tasks into a multi-step interactive process, +enabling KGs to participate deeply in the reasoning process. SymAgent consists +of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages +LLM's inductive reasoning capability to extract symbolic rules from KGs, +guiding efficient question decomposition. The Agent-Executor autonomously +invokes predefined action tools to integrate information from KGs and external +documents, addressing the issues of KG incompleteness. Furthermore, we design a +self-learning framework comprising online exploration and offline iterative +policy updating phases, enabling the agent to automatically synthesize +reasoning trajectories and improve performance. Experimental results +demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields +better or comparable performance compared to various strong baselines. Further +analysis reveals that our agent can identify missing triples, facilitating +automatic KG updates. + +摘要:最近的進展強調出,大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺,導致錯誤的結果。為了解決這個問題,研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而,現有方法面臨兩個限制:1) 它們通常假設問題的所有答案都包含在 KG 中,忽略了 KG 的不完整性問題,以及 2) 它們將 KG 視為一個靜態儲存庫,而忽略了 KG 中固有的隱式邏輯推理結構。在本文中,我們介紹了 SymAgent,一個創新的神經符號代理架構,它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境,並將複雜的推理任務轉化為一個多步驟的互動過程,使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組:代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則,指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊,解決 KG 不完整性的問題。此外,我們設計了一個自學習框架,包括線上探索和離線反覆的政策更新階段,使代理能夠自動合成推理軌跡並改善效能。實驗結果表明,具有弱 LLM 主幹的 SymAgent(例如,7B 系列)與各種強大的基線相比,產生了更好或相當的效能。進一步的分析表明,我們的代理可以識別遺失的三元組,促進自動 KG 更新。 + +##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models** +2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov + +We introduce a new approach to systematically map features discovered by +sparse autoencoder across consecutive layers of large language models, +extending earlier work that examined inter-layer feature links. By using a +data-free cosine similarity technique, we trace how specific features persist, +transform, or first appear at each stage. This method yields granular flow +graphs of feature evolution, enabling fine-grained interpretability and +mechanistic insights into model computations. Crucially, we demonstrate how +these cross-layer feature maps facilitate direct steering of model behavior by +amplifying or suppressing chosen features, achieving targeted thematic control +in text generation. Together, our findings highlight the utility of a causal, +cross-layer interpretability framework that not only clarifies how features +develop through forward passes but also provides new means for transparent +manipulation of large language models. + +摘要:我們提出了一種新方法,用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能,擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術,我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖,實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是,我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導,在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用,不僅闡明了特徵如何透過前向傳遞發展,還提供了新的方法來透明地操作大型語言模型。 + +##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs** +2502.02896v1 by Bradley P. Allen, Paul T. Groth + +Evaluating large language models (LLMs) for tasks like fact extraction in +support of knowledge graph construction frequently involves computing accuracy +metrics using a ground truth benchmark based on a knowledge graph (KG). These +evaluations assume that errors represent factual disagreements. However, human +discourse frequently features metalinguistic disagreement, where agents differ +not on facts but on the meaning of the language used to express them. Given the +complexity of natural language processing and generation using LLMs, we ask: do +metalinguistic disagreements occur between LLMs and KGs? Based on an +investigation using the T-REx knowledge alignment dataset, we hypothesize that +metalinguistic disagreement does in fact occur between LLMs and KGs, with +potential relevance for the practice of knowledge graph engineering. We propose +a benchmark for evaluating the detection of factual and metalinguistic +disagreements between LLMs and KGs. An initial proof of concept of such a +benchmark is available on Github. + +摘要:評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時,通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而,人類話語經常出現元語言分歧,其中代理人之間的差異不在於事實,而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性,我們提出疑問:LLM 和 KG 之間是否會發生元語言分歧?根據使用 T-REx 知識比對資料集進行的調查,我們假設元語言分歧確實會發生在 LLM 和 KG 之間,並可能與知識圖譜工程實務有關。我們提出一個基準,用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。 + +##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization** +2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim + +Recent advances in Large Language Models (LLMs) have motivated the +development of general LLMs for molecular tasks. While several studies have +demonstrated that fine-tuned LLMs can achieve impressive benchmark +performances, they are far from genuine generalist molecular LLMs due to a lack +of fundamental understanding of molecular structure. Specifically, when given +molecular task instructions, LLMs trained with naive next-token prediction +training assign similar likelihood scores to both original and negatively +corrupted molecules, revealing their lack of molecular structure understanding +that is crucial for reliable and general molecular LLMs. To overcome this +limitation and obtain a true generalist molecular LLM, we introduce a novel +multi-modal training method based on a thorough multi-modal instruction tuning +as well as a molecular structure preference optimization between chosen and +rejected graphs. On various molecular benchmarks, the proposed generalist +molecular LLM, called Mol-LLM, achieves state-of-the-art performances among +generalist LLMs on most tasks, at the same time, surpassing or comparable to +state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior +generalization performances in reaction prediction tasks, demonstrating the +effect of the molecular structure understanding for generalization perspective. + +摘要:大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能,但由於缺乏對分子結構的基本理解,它們遠非真正的通才分子 LLM。具體來說,當給予分子任務說明時,使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子,這顯示出它們缺乏對分子結構的理解,而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM,我們引入了一種新穎的多模態訓練方法,該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中,所提出的通才分子 LLM(稱為 Mol-LLM)在多數任務中實現了通才 LLM 中的最新效能,同時超越或與最新的專家 LLM 相當。此外,Mol-LLM 在反應預測任務中也展現出優異的泛化效能,證明了分子結構理解對泛化觀點的影響。 + +##### **Leveraging the true depth of LLMs** +2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret + +Large Language Models demonstrate remarkable capabilities at the cost of high +compute requirements. While recent research has shown that intermediate layers +can be removed or have their order shuffled without impacting performance +significantly, these findings have not been employed to reduce the +computational cost of inference. We investigate several potential ways to +reduce the depth of pre-trained LLMs without significantly affecting +performance. Leveraging our insights, we present a novel approach that exploits +this decoupling between layers by grouping some of them into pairs that can be +evaluated in parallel. + This modification of the computational graph -- through better parallelism -- +results in an average improvement of around 1.20x on the number of tokens +generated per second, without re-training nor fine-tuning, while retaining +95%-99% of the original accuracy. Empirical evaluation demonstrates that this +approach significantly improves serving efficiency while maintaining model +performance, offering a practical improvement for large-scale LLM deployment. + +摘要:大型语言模型展示了其强大的功能,但代价是较高的计算需求。虽然最近的研究表明,中间层可以被移除或重新排列其顺序,而不会显著影响性能,但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度,而不会显著影响性能。利用我们的见解,我们提出了一种新颖的方法,该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。 +通过更好的并行性对计算图进行修改,平均而言,每秒生成的令牌数量提高了约 1.20 倍,而无需重新训练或微调,同时保留了 95%-99% 的原始准确性。经验评估表明,这种方法显著提高了服务效率,同时保持了模型性能,为大规模 LLM 部署提供了实际改进。 + +##### **Modular Training of Neural Networks aids Interpretability** +2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots + +An approach to improve neural network interpretability is via clusterability, +i.e., splitting a model into disjoint clusters that can be studied +independently. We define a measure for clusterability and show that pre-trained +models form highly enmeshed clusters via spectral graph clustering. We thus +train models to be more modular using a "clusterability loss" function that +encourages the formation of non-interacting clusters. Using automated +interpretability techniques, we show that our method can help train models that +are more modular and learn different, disjoint, and smaller circuits. We +investigate CNNs trained on MNIST and CIFAR, small transformers trained on +modular addition, and language models. Our approach provides a promising +direction for training neural networks that learn simpler functions and are +easier to interpret. + +摘要:一種改善神經網路可解釋性的方法是透過群集性, +也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量,並顯示預訓練的 +模型透過光譜圖形群集形成高度糾纏的群集。因此,我們使用「群集性損失」函數訓練模型,使其更具模組化, +這鼓勵形成非交互群集。使用自動化可解釋性技術,我們顯示我們的模型可以幫助訓練更具模組化的模型,並學習不同、不相交且較小的電路。我們 +研究了在 MNIST 和 CIFAR 上訓練的 CNN,在模組化加法上訓練的小型Transformer,以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。 + +##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs** +2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür + +Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large +language models (LLMs) by enabling detailed step-by-step solutions. However, +due to the verbosity of LLMs, the resulting reasoning chains can be long, +making it harder to verify the reasoning steps and trace issues resulting from +dependencies between the steps that may be farther away in the sequence of +steps. Importantly, mathematical reasoning allows each step to be derived from +a small set of premises, which are a subset of the preceding steps in the +reasoning chain. In this paper, we present a framework that identifies the +premises for each step, to improve the evaluation of reasoning. We restructure +conventional linear reasoning chains into Premise Augmented Reasoning Chains +(PARC) by introducing premise links, resulting in a directed acyclic graph +where the nodes are the steps and the edges are the premise links. Through +experiments with a PARC-based dataset that we built, namely PERL (Premises and +ERrors identification in LLMs), we demonstrate that LLMs can reliably identify +premises within complex reasoning chains. In particular, even open-source LLMs +achieve 90% recall in premise identification. We also show that PARC helps to +identify errors in reasoning chains more reliably. The accuracy of error +identification improves by 6% to 16% absolute when step-by-step verification is +carried out in PARC under the premises. Our findings highlight the utility of +premise-centric representations in addressing complex problem-solving tasks and +open new avenues for improving the reliability of LLM-based reasoning +evaluations. + +摘要:思考鏈(CoT)提示透過提供詳細的逐步解法,增強大型語言模型(LLM)的數學推理能力。然而,由於 LLM 的冗長,產生的推理鏈可能很長,這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難,而這些步驟可能在步驟順序中相距較遠。重要的是,數學推理允許每個步驟從一組小的前提中推導出來,這些前提是推理鏈中前一個步驟的子集。在本文中,我們提出了一個框架,用於識別每個步驟的前提,以改進推理評估。我們透過引入前提連結,將傳統的線性推理鏈重組為前提擴充推理鏈(PARC),產生一個有向無環圖,其中節點是步驟,而邊緣是前提連結。透過我們建立的基於 PARC 的資料集(即 PERL(LLM 中的前提和錯誤識別))進行的實驗,我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是,即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明,PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時,錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用,並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。 + +##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement** +2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna + +Embodied agents assisting humans are often asked to complete a new task in a +new scenario. An agent preparing a particular dish in the kitchen based on a +known recipe may be asked to prepare a new dish or to perform cleaning tasks in +the storeroom. There may not be sufficient resources, e.g., time or labeled +examples, to train the agent for these new situations. Large Language Models +(LLMs) trained on considerable knowledge across many domains are able to +predict a sequence of abstract actions for such new tasks and scenarios, +although it may not be possible for the agent to execute this action sequence +due to task-, agent-, or domain-specific constraints. Our framework addresses +these challenges by leveraging the generic predictions provided by LLM and the +prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an +agent to quickly adapt to new tasks and scenarios. The robot also solicits and +uses human input as needed to refine its existing knowledge. Based on +experimental evaluation over cooking and cleaning tasks in simulation domains, +we demonstrate that the interplay between LLM, KG, and human input leads to +substantial performance gains compared with just using the LLM output. + +摘要:具身代理协助人类时,通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源(例如时间或标记的示例)来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列,尽管代理可能无法执行此动作序列,因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战,使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估,我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。 + +##### **On Bob Dylan: A Computational Perspective** +2502.01772v1 by Prashant Garg + +Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style +-- a constant refusal to conform to expectation and a penchant for reinventing +his musical and lyrical identity. In this paper, I extend Sunstein's +observations through a large-scale computational analysis of Dylan's lyrics +from 1962 to 2012. Using o3-mini-high (a large language model), I extract +concept-to-concept relationships from the lyrics and construct directed +knowledge graphs that capture Dylan's thematic structure. I then quantify +shifts in sentiment, metaphorical expression, thematic diversity, and network +complexity over time. The results indicate that Dylan's lyrics increasingly +rely on metaphor, display an evolving sentiment profile, and exhibit heightened +dishabituation -- measured here as a growing variance in the network centrality +of key concepts. I also find that references to movement, protest, and mythic +imagery fluctuate in ways that align with well-known phases of Dylan's career, +reflecting the dynamic and unpredictable quality of his art. These findings not +only deepen our empirical understanding of Sunstein's thesis but also introduce +a novel computational method for analyzing an artist's evolution-offering +broader applicability to the study of cultural and creative change. + +摘要:卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格 +-- 這種風格不斷拒絕符合預期,並熱衷於重新塑造他的音樂和歌詞認同。在本文中,我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析,來延伸桑斯坦的觀察。使用 o3-mini-high(一個大型語言模型),我從歌詞中提取概念對概念的關係,並建構有向知識圖,以捕捉迪倫的主題結構。然後,我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示,迪倫的歌詞越來越依賴隱喻,展現出不斷演化的情緒輪廓,並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現,對運動、抗議和神話意象的引用,會以與迪倫職業生涯中眾所周知階段一致的方式波動,反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解,也引入了分析藝術家演變的新穎運算方法,為文化和創造性變化的研究提供了更廣泛的適用性。 + +##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos** +2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang + +Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in +enhancing Large Language Models (LLMs) through external knowledge integration, +yet its application has primarily focused on textual content, leaving the rich +domain of multi-modal video knowledge predominantly unexplored. This paper +introduces VideoRAG, the first retrieval-augmented generation framework +specifically designed for processing and understanding extremely long-context +videos. Our core innovation lies in its dual-channel architecture that +seamlessly integrates (i) graph-based textual knowledge grounding for capturing +cross-video semantic relationships, and (ii) multi-modal context encoding for +efficiently preserving visual features. This novel design empowers VideoRAG to +process unlimited-length videos by constructing precise knowledge graphs that +span multiple videos while maintaining semantic dependencies through +specialized multi-modal retrieval paradigms. Through comprehensive empirical +evaluation on our proposed LongerVideos benchmark-comprising over 160 videos +totaling 134+ hours across lecture, documentary, and entertainment +categories-VideoRAG demonstrates substantial performance compared to existing +RAG alternatives and long video understanding methods. The source code of +VideoRAG implementation and the benchmark dataset are openly available at: +https://github.com/HKUDS/VideoRAG. + +摘要:檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功,但其應用主要集中在文字內容上,而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG,這是第一個檢索增強生成架構,專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構,它無縫整合 (i) 基於圖形文字知識基礎,用於擷取跨影片語義關係,以及 (ii) 多模態語境編碼,用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片,同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估,該基準包含超過 160 部影片,總時數超過 134 小時,涵蓋演講、紀錄片和娛樂類別,VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比,展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於:https://github.com/HKUDS/VideoRAG。 + +##### **Transformers trained on proteins can learn to attend to Euclidean distance** +2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane + +While conventional Transformers generally operate on sequence data, they can +be used in conjunction with structure models, typically SE(3)-invariant or +equivariant graph neural networks (GNNs), for 3D applications such as protein +structure modelling. These hybrids typically involve either (1) +preprocessing/tokenizing structural features as input for Transformers or (2) +taking Transformer embeddings and processing them within a structural +representation. However, there is evidence that Transformers can learn to +process structural information on their own, such as the AlphaFold3 structural +diffusion model. In this work we show that Transformers can function +independently as structure models when passed linear embeddings of coordinates. +We first provide a theoretical explanation for how Transformers can learn to +filter attention as a 3D Gaussian with learned variance. We then validate this +theory using both simulated 3D points and in the context of masked token +prediction for proteins. Finally, we show that pre-training protein Transformer +encoders with structure improves performance on a downstream task, yielding +better performance than custom structural models. Together, this work provides +a basis for using standard Transformers as hybrid structure-language models. + +摘要:雖然傳統的 Transformer 通常處理序列資料,但它們可用於結構模型,通常是 SE(3) 不變式或等變式圖神經網路 (GNN),用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而,有證據表明 Transformer 可以自行學習處理結構資訊,例如 AlphaFold3 結構擴散模型。在這項工作中,我們展示了 Transformer 在傳遞座標的線性嵌入時,可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後,我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能,產生比自訂結構模型更好的效能。綜合來說,這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。 + + +### Medical explainable AI +|Publish Date|Title|Authors|Homepage|Code| +| :---: | :---: | :---: | :---: | :---: | +|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| +|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| +|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| +|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| +|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null| +|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)| +|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null| +|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null| +|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v2](http://arxiv.org/abs/2501.09628v2)|null| +|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null| +|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null| +|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null| +|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null| +|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null| +|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)| +|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null| +|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null| +|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null| +|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null| +|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null| +|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null| +|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null| +|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null| +|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null| +|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)| +|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null| +|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null| +|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null| +|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)| +|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)| +|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null| +|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)| +|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null| +|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null| +|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null| +|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null| +|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null| +|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null| +|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null| +|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null| +|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare**|Antonio Rago et.al.|[2408.17401v2](http://arxiv.org/abs/2408.17401v2)|null| +|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null| +|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null| +|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null| +|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null| +|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null| +|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null| +|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null| +|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null| +|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null| +|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null| +|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null| +|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null| +|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null| +|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null| +|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)| +|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null| +|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null| +|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null| +|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null| +|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null| +|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null| +|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)| +|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null| +|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null| +|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null| +|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null| +|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null| +|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)| +|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null| +|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)| +|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null| +|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null| +|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null| +|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null| +|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null| +|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null| +|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null| +|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null| +|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null| +|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null| +|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null| +|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null| +|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)| +|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null| +|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null| +|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null| +|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)| +|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)| +|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null| +|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null| +|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null| +|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null| +|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null| +|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null| +|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null| +|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)| +|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null| +|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null| +|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null| + +#### Abstracts +##### **Towards a perturbation-based explanation for medical AI as differentiable programs** +2502.14001v1 by Takeshi Abe, Yoshiyuki Asai + +Recent advancement in machine learning algorithms reaches a point where +medical devices can be equipped with artificial intelligence (AI) models for +diagnostic support and routine automation in clinical settings. In medicine and +healthcare, there is a particular demand for sufficient and objective +explainability of the outcome generated by AI models. However, AI models are +generally considered as black boxes due to their complexity, and the +computational process leading to their response is often opaque. Although +several methods have been proposed to explain the behavior of models by +evaluating the importance of each feature in discrimination and prediction, +they may suffer from biases and opacities arising from the scale and sampling +protocol of the dataset used for training or testing. To overcome the +shortcomings of existing methods, we explore an alternative approach to provide +an objective explanation of AI models that can be defined independently of the +learning process and does not require additional data. As a preliminary study +for this direction of research, this work examines a numerical availability of +the Jacobian matrix of deep learning models that measures how stably a model +responses against small perturbations added to the input. The indicator, if +available, are calculated from a trained AI model for a given target input. +This is a first step towards a perturbation-based explanation, which will +assist medical practitioners in understanding and interpreting the response of +the AI model in its clinical application. + +摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 + +##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** +2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker + +Explainability remains a significant problem for AI models in medical +imaging, making it challenging for clinicians to trust AI-driven predictions. +We introduce 3D ReX, the first causality-based post-hoc explainability tool for +3D models. 3D ReX uses the theory of actual causality to generate +responsibility maps which highlight the regions most crucial to the model's +decision. We test 3D ReX on a stroke detection model, providing insight into +the spatial distribution of features relevant to stroke. + +摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 +我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 + +##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** +2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano + +This paper presents a complete explainable system that interprets a set of +data, abstracts the underlying features and describes them in a natural +language of choice. The system relies on two crucial stages: (i) identifying +emerging properties from data and transforming them into abstract concepts, and +(ii) converting these concepts into natural language. Despite the impressive +natural language generation capabilities demonstrated by Large Language Models, +their statistical nature and the intricacy of their internal mechanism still +force us to employ these techniques as black boxes, forgoing trustworthiness. +Developing an explainable pipeline for data interpretation would allow +facilitating its use in safety-critical environments like processing medical +information and allowing non-experts and visually impaired people to access +narrated information. To this end, we believe that the fields of knowledge +representation and automated reasoning research could present a valid +alternative. Expanding on prior research that tackled the first stage (i), we +focus on the second stage, named Concept2Text. Being explainable, data +translation is easily modeled through logic-based rules, once again emphasizing +the role of declarative programming in achieving AI explainability. This paper +explores a Prolog/CLP-based rewriting system to interpret concepts-articulated +in terms of classes and relations, plus common knowledge-derived from a generic +ontology, generating natural language text. Its main features include +hierarchical tree rewritings, modular multilingual generation, support for +equivalent variants across semantic, grammar, and lexical levels, and a +transparent rule-based system. We outline the architecture and demonstrate its +flexibility through some examples capable of generating numerous diverse and +equivalent rewritings based on the input concept. + +摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 + +##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** +2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek + +We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), +an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS +predicts future PHTs using transformer-based architectures. The Adaptive Risk +Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk +probabilities for clinician-defined critical events. ARES incorporates a +personalized explainability module that identifies key clinical factors +influencing risk estimates for individual patients. ARES was evaluated on the +MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its +performance against traditional early warning systems and machine learning +models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, +with 60% including hospital admissions. The dataset contained over 357 million +tokens. ETHOS outperformed benchmark models in predicting hospital admissions, +ICU admissions, and prolonged hospital stays, achieving superior AUC scores. +ETHOS-based risk estimates demonstrated robustness across demographic subgroups +with strong model reliability, confirmed via calibration curves. The +personalized explainability module provides insights into patient-specific +factors contributing to risk. ARES, powered by ETHOS, advances predictive +healthcare AI by providing dynamic, real-time, and personalized risk estimation +with patient-specific explainability to enhance clinician trust. Its +adaptability and superior accuracy position it as a transformative tool for +clinical decision-making, potentially improving patient outcomes and resource +allocation in emergency and inpatient settings. We release the full code at +github.com/ipolharvard/ethos-ares to facilitate future research. + +摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), +一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS +使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 + +##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases** +2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq + +This study addresses a critical gap in the healthcare system by developing a +clinically meaningful, practical, and explainable disease surveillance system +for multiple chronic diseases, utilizing routine EHR data from multiple U.S. +practices integrated with CureMD's EMR/EHR system. Unlike traditional +systems--using AI models that rely on features from patients' labs--our +approach focuses on routinely available data, such as medical history, vitals, +diagnoses, and medications, to preemptively assess the risks of chronic +diseases in the next year. We trained three distinct models for each chronic +disease: prediction models that forecast the risk of a disease 3, 6, and 12 +months before a potential diagnosis. We developed Random Forest models, which +were internally validated using F1 scores and AUROC as performance metrics and +further evaluated by a panel of expert physicians for clinical relevance based +on inferences grounded in medical knowledge. Additionally, we discuss our +implementation of integrating these models into a practical EMR system. Beyond +using Shapley attributes and surrogate models for explainability, we also +introduce a new rule-engineering framework to enhance the intrinsic +explainability of Random Forests. + +摘要:本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統,來解決醫療保健系統中的重大缺口,利用整合 CureMD 的 EMR/EHR 系統,來自多個美國實務的例行 EHR 資料。與傳統系統不同的是,我們的做法著重在例行可得的資料,例如病歷、生命徵象、診斷和藥物,以預先評估未來一年慢性疾病的風險,而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型:預測模型,用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型,並使用 F1 分數和 AUROC 作為效能指標,進行內部驗證,並進一步由專家醫師小組根據植基於醫學知識的推論,評估其臨床相關性。此外,我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外,我們還引進了一個新的規則工程架構,以增強隨機森林的內在可解釋性。 + +##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data** +2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek + +Deep neural networks are increasingly employed in high-stakes medical +applications, despite their tendency for shortcut learning in the presence of +spurious correlations, which can have potentially fatal consequences in +practice. Detecting and mitigating shortcut behavior is a challenging task that +often requires significant labeling efforts from domain experts. To alleviate +this problem, we introduce a semi-automated framework for the identification of +spurious behavior from both data and model perspective by leveraging insights +from eXplainable Artificial Intelligence (XAI). This allows the retrieval of +spurious data points and the detection of model circuits that encode the +associated prediction rules. Moreover, we demonstrate how these shortcut +encodings can be used for XAI-based sample- and pixel-level data annotation, +providing valuable information for bias mitigation methods to unlearn the +undesired shortcut behavior. We show the applicability of our framework using +four medical datasets across two modalities, featuring controlled and +real-world spurious correlations caused by data artifacts. We successfully +identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision +Transformer models, ultimately increasing their robustness and applicability +for real-world medical tasks. + +摘要:深度神经网络越来越多地用于高风险医疗应用中,尽管它们在存在虚假相关性的情况下倾向于捷径学习,这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务,通常需要领域专家的大量标记工作。为了缓解这个问题,我们引入了一个半自动框架,用于从数据和模型的角度识别虚假行为,方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外,我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释,为偏差缓解方法提供有价值的信息,以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性,这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差,最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。 + +##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model** +2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail + +Suicidal ideation detection is crucial for preventing suicides, a leading +cause of death worldwide. Many individuals express suicidal thoughts on social +media, offering a vital opportunity for early detection through advanced +machine learning techniques. The identification of suicidal ideation in social +media text is improved by utilising a hybrid framework that integrates +Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory +(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability +of the model's predictions, Explainable AI (XAI) methods are applied, with a +particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At +first, the model managed to reach an accuracy of 92.81%. By applying +fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The +SHAP analysis revealed key features influencing the model's predictions, such +as terms related to mental health struggles. This level of transparency boosts +the model's credibility while helping mental health professionals understand +and trust the predictions. This work highlights the potential for improving the +accuracy and interpretability of detecting suicidal tendencies, making a +valuable contribution to the progress of mental health monitoring systems. It +emphasizes the significance of blending powerful machine learning methods with +explainability to develop reliable and impactful mental health solutions. + +摘要:自殺意念偵測對於預防自殺至關重要,而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭,這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構,並加入注意力機制,可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性,我們採用可解釋人工智慧 (XAI) 方法,特別著重於 SHapley 加法解釋 (SHAP)。一開始,模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術,準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵,例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度,同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力,為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。 + +##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights** +2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet + +In epidemiology, traditional statistical methods such as logistic regression, +linear regression, and other parametric models are commonly employed to +investigate associations between predictors and health outcomes. However, +non-parametric machine learning techniques, such as deep neural networks +(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for +this task. Despite their potential, these methods face challenges due to the +limited availability of high-quality, high-quantity data in this field. To +address these challenges, we introduce SEANN, a novel approach for informed +DNNs that leverages a prevalent form of domain-specific knowledge: Pooled +Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies, +in different forms, and represent a quantitative form of a scientific +consensus. By direct integration within the learning procedure using a custom +loss, we experimentally demonstrate significant improvements in the +generalizability of predictive performances and the scientific plausibility of +extracted relationships compared to a domain-knowledge agnostic neural network +in a scarce and noisy data setting. + +摘要:在流行病學中,傳統的統計方法,例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而,非參數機器學習技術,例如深度神經網路 (DNN),結合可解釋的 AI (XAI) 工具,為這項任務提供了新的機會。儘管這些方法具有潛力,但由於該領域缺乏高品質、高數量資料,因此這些方法面臨挑戰。為了應對這些挑戰,我們引入了 SEANN,這是一種新穎的方法,用於獲取知識的 DNN,它利用了一種流行的領域特定知識形式:彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中,並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中,我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比,科學合理性的顯著提升,且是在稀少且有雜訊的資料設定中。 + +##### **Artificial Intelligence-Driven Clinical Decision Support Systems** +2501.09628v2 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni + +As artificial intelligence (AI) becomes increasingly embedded in healthcare +delivery, this chapter explores the critical aspects of developing reliable and +ethical Clinical Decision Support Systems (CDSS). Beginning with the +fundamental transition from traditional statistical models to sophisticated +machine learning approaches, this work examines rigorous validation strategies +and performance assessment methods, including the crucial role of model +calibration and decision curve analysis. The chapter emphasizes that creating +trustworthy AI systems in healthcare requires more than just technical +accuracy; it demands careful consideration of fairness, explainability, and +privacy. The challenge of ensuring equitable healthcare delivery through AI is +stressed, discussing methods to identify and mitigate bias in clinical +predictive models. The chapter then delves into explainability as a cornerstone +of human-centered CDSS. This focus reflects the understanding that healthcare +professionals must not only trust AI recommendations but also comprehend their +underlying reasoning. The discussion advances in an analysis of privacy +vulnerabilities in medical AI systems, from data leakage in deep learning +models to sophisticated attacks against model explanations. The text explores +privacy-preservation strategies such as differential privacy and federated +learning, while acknowledging the inherent trade-offs between privacy +protection and model performance. This progression, from technical validation +to ethical considerations, reflects the multifaceted challenges of developing +AI systems that can be seamlessly and reliably integrated into daily clinical +practice while maintaining the highest standards of patient care and data +protection. + +摘要:隨著人工智慧(AI)在醫療保健服務中日益普及,本章探討了開發可靠且符合道德的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型轉變到複雜機器學習方法的基本原理開始,這項工作探討了嚴謹的驗證策略和效能評估方法,包括模型校準和決策曲線分析的關鍵角色。本章強調,在醫療保健中建立值得信賴的 AI 系統不僅需要技術準確性;它需要仔細考量公平性、可解釋性和隱私。本章強調了透過 AI 確保公平醫療保健服務的挑戰,並討論了識別和減輕臨床預測模型中偏差的方法。接著,本章深入探討可解釋性作為以人為中心的 CDSS 的基石。這種關注反映了對醫療保健專業人員不僅必須信任 AI 建議,還必須理解其背後推理的理解。討論進展到對醫療 AI 系統中隱私漏洞的分析,從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略,例如差分隱私和聯合學習,同時承認隱私保護和模型效能之間的固有權衡。從技術驗證到道德考量,這種進展反映了開發 AI 系統的多方面挑戰,這些系統可以無縫且可靠地整合到日常臨床實務中,同時維持最高標準的患者照護和資料保護。 + +##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis** +2501.06887v1 by Sadia Kamal, Tim Oates + +As deep learning models gain attraction in medical data, ensuring transparent +and trustworthy decision-making is essential. In skin cancer diagnosis, while +advancements in lesion detection and classification have improved accuracy, the +black-box nature of these methods poses challenges in understanding their +decision processes, leading to trust issues among physicians. This study +leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on +different skin lesion datasets, to capture meaningful relationships between +visual features and diagnostic criteria terms. To further enhance transparency, +we propose a method called MedGrad E-CLIP, which builds on gradient-based +E-CLIP by incorporating a weighted entropy mechanism designed for complex +medical imaging like skin lesions. This approach highlights critical image +regions linked to specific diagnostic descriptions. The developed integrated +pipeline not only classifies skin lesions by matching corresponding +descriptions but also adds an essential layer of explainability developed +especially for medical data. By visually explaining how different features in +an image relates to diagnostic criteria, this approach demonstrates the +potential of advanced vision-language models in medical image analysis, +ultimately improving transparency, robustness, and trust in AI-driven +diagnostic systems. + +摘要:随着深度学习模型在医学数据中获得关注,确保透明且值得信赖的决策至关重要。在皮肤癌诊断中,虽然病灶检测和分类的进步提高了准确性,但这些方法的黑盒性质对理解其决策过程构成了挑战,导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP(对比语言图像预训练)模型,以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度,我们提出了一种名为 MedGrad E-CLIP 的方法,该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制,建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类,还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系,这种方法展示了高级视觉语言模型在医学图像分析中的潜力,最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。 + +##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis** +2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat + +Humour styles can have either a negative or a positive impact on well-being. +Given the importance of these styles to mental health, significant research has +been conducted on their automatic identification. However, the automated +machine learning models used for this purpose are black boxes, making their +prediction decisions opaque. Clarity and transparency are vital in the field of +mental health. This paper presents an explainable AI (XAI) framework for +understanding humour style classification, building upon previous work in +computational humour analysis. Using the best-performing single model +(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to +analyse how linguistic, emotional, and semantic features contribute to humour +style classification decisions. Our analysis reveals distinct patterns in how +different humour styles are characterised and misclassified, with particular +emphasis on the challenges in distinguishing affiliative humour from other +styles. Through detailed examination of feature importance, error patterns, and +misclassification cases, we identify key factors influencing model decisions, +including emotional ambiguity, context misinterpretation, and target +identification. The framework demonstrates significant utility in understanding +model behaviour, achieving interpretable insights into the complex interplay of +features that define different humour styles. Our findings contribute to both +the theoretical understanding of computational humour analysis and practical +applications in mental health, content moderation, and digital humanities +research. + +摘要:幽默風格對幸福感可能產生負面或正面的影響。 +鑑於這些風格對心理健康的重要性,已經對其自動識別進行了大量研究。然而,用於此目的的自動機器學習模型是黑盒子,使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架,用於理解幽默風格分類,建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost),我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式,特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例,我們確定了影響模型決策的關鍵因素,包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用,實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。 + +##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support** +2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani + +The increasing demand for mental health services has highlighted the need for +innovative solutions, particularly in the realm of psychological conversational +AI, where the availability of sensitive data is scarce. In this work, we +explored the development of a system tailored for mental health support with a +novel approach to psychological assessment based on explainable emotional +profiles in combination with empathetic conversational models, offering a +promising tool for augmenting traditional care, particularly where immediate +expertise is unavailable. Our work can be divided into two main parts, +intrinsecaly connected to each other. First, we present RACLETTE, a +conversational system that demonstrates superior emotional accuracy compared to +state-of-the-art benchmarks in both understanding users' emotional states and +generating empathetic responses during conversations, while progressively +building an emotional profile of the user through their interactions. Second, +we show how the emotional profiles of a user can be used as interpretable +markers for mental health assessment. These profiles can be compared with +characteristic emotional patterns associated with different mental disorders, +providing a novel approach to preliminary screening and support. + +摘要:隨著對心理健康服務需求的增加,凸顯了創新解決方案的需求,特別是在心理對話式人工智慧領域,那裡缺乏敏感資料。在這項工作中,我們探索了開發一個針對心理健康支持的系統,採用一種基於可解釋的情緒特徵的新方法進行心理評估,結合同理心對話模式,提供了一個有前途的工具,用於擴充傳統照護,特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分,彼此內在相關。首先,我們展示了 RACLETTE,一個對話系統,與最先進的基準相比,在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性,同時透過他們的互動逐漸建立使用者的情緒特徵。其次,我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較,提供了一種初步篩選和支持的新方法。 + +##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation** +2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia + +Artificial intelligence (AI) has emerged as a powerful tool to enhance +decision-making and optimize treatment protocols in in vitro fertilization +(IVF). In particular, AI shows significant promise in supporting +decision-making during the ovarian stimulation phase of the IVF process. This +review evaluates studies focused on the applications of AI combined with +medical imaging in ovarian stimulation, examining methodologies, outcomes, and +current limitations. Our analysis of 13 studies on this topic reveals that, +reveal that while AI algorithms demonstrated notable potential in predicting +optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the +medical imaging data utilized predominantly came from two-dimensional (2D) +ultrasound which mainly involved basic quantifications, such as follicle size +and number, with limited use of direct feature extraction or advanced image +analysis techniques. This points to an underexplored opportunity where advanced +image analysis approaches, such as deep learning, and more diverse imaging +modalities, like three-dimensional (3D) ultrasound, could unlock deeper +insights. Additionally, the lack of explainable AI (XAI) in most studies raises +concerns about the transparency and traceability of AI-driven decisions - key +factors for clinical adoption and trust. Furthermore, many studies relied on +single-center designs and small datasets, which limit the generalizability of +their findings. This review highlights the need for integrating advanced +imaging analysis techniques with explainable AI methodologies, as well as the +importance of leveraging multicenter collaborations and larger datasets. +Addressing these gaps has the potential to enhance ovarian stimulation +management, paving the way for efficient, personalized, and data-driven +treatment pathways that improve IVF outcomes. + +摘要:人工智慧(AI)已成為增強體外受精(IVF)決策制定和優化治療方案的強大工具。特別是,AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示,雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力,但所利用的醫學影像數據主要來自於二次元(2D)超音波,而二次元超音波主要涉及基本量化,例如濾泡大小和數量,且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會,例如深度學習等進階影像分析方法,以及更多元的影像模式,例如三維(3D)超音波,可以解鎖更深入的見解。此外,大多數研究缺乏可解釋 AI(XAI),這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂,而透明度和可追溯性是臨床採用和信任的關鍵因素。此外,許多研究依賴於單中心設計和小型數據集,這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性,以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理,為有效、個人化和數據驅動的治療途徑鋪平道路,進而改善 IVF 結果。 + +##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models** +2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman + +This research presents an innovative approach to cancer diagnosis and +prediction using explainable Artificial Intelligence (XAI) and deep learning +techniques. With cancer causing nearly 10 million deaths globally in 2020, +early and accurate diagnosis is crucial. Traditional methods often face +challenges in cost, accuracy, and efficiency. Our study develops an AI model +that provides precise outcomes and clear insights into its decision-making +process, addressing the "black box" problem of deep learning models. By +employing XAI techniques, we enhance interpretability and transparency, +building trust among healthcare professionals and patients. Our approach +leverages neural networks to analyse extensive datasets, identifying patterns +for cancer detection. This model has the potential to revolutionise diagnosis +by improving accuracy, accessibility, and clarity in medical decision-making, +possibly leading to earlier detection and more personalised treatment +strategies. Furthermore, it could democratise access to high-quality +diagnostics, particularly in resource-limited settings, contributing to global +health equity. The model's applications extend beyond cancer diagnosis, +potentially transforming various aspects of medical decision-making and saving +millions of lives worldwide. + +摘要:本研究提出了一個創新的癌症診斷和預測方法,使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡,因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型,它提供精確的結果並清楚地了解其決策過程,解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術,我們增強了解釋性和透明度,在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集,識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷,可能導致更早的檢測和更個性化的治療策略。此外,它可以使更多人獲得高品質的診斷,特別是在資源有限的環境中,有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷,還可能轉變醫療決策的各個方面,並拯救全球數百萬人的生命。 + +##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG** +2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag + +Deep learning has advanced medical image classification, but interpretability +challenges hinder its clinical adoption. This study enhances interpretability +in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs) +and a multi-agent Retrieval-Augmented Generation (RAG) system for report +generation. By modeling relationships between visual features and clinical +concepts, we create interpretable concept vectors that guide a multi-agent RAG +system to generate radiology reports, enhancing clinical relevance, +explainability, and transparency. Evaluation of the generated reports using an +LLM-as-a-judge confirmed the interpretability and clinical utility of our +model's outputs. On the COVID-QU dataset, our model achieved 81% classification +accuracy and demonstrated robust report generation performance, with five key +metrics ranging between 84% and 90%. This interpretable multi-agent framework +bridges the gap between high-performance AI and the explainability required for +reliable AI-driven CXR analysis in clinical settings. Our code is available at +https://github.com/tifat58/IRR-with-CBM-RAG.git. + +摘要:深度學習已提升醫學影像分類,但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成,來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係,我們建立可解釋的概念向量,引導多代理 RAG 系統生成放射報告,增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估,確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上,我們的模型達到了 81% 的分類準確率,並展示了穩健的報告生成效能,五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。 + +##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models** +2412.15748v1 by Shamus Sim, Tyrone Chen + +Background: Despite the current ubiquity of Large Language Models (LLMs) +across the medical domain, there is a surprising lack of studies which address +their reasoning behaviour. We emphasise the importance of understanding +reasoning behaviour as opposed to high-level prediction accuracies, since it is +equivalent to explainable AI (XAI) in this context. In particular, achieving +XAI in medical LLMs used in the clinical domain will have a significant impact +across the healthcare sector. Results: Therefore, we define the concept of +reasoning behaviour in the specific context of medical LLMs. We then categorise +and discuss the current state of the art of methods which evaluate reasoning +behaviour in medical LLMs. Finally, we propose theoretical frameworks which can +empower medical professionals or machine learning engineers to gain insight +into the low-level reasoning operations of these previously obscure models. +Conclusion: The subsequent increased transparency and trust in medical machine +learning models by clinicians as well as patients will accelerate the +integration, application as well as further development of medical AI for the +healthcare system as a whole + +摘要:背景:儘管大型語言模型 (LLM) 目前在醫療領域無所不在,但令人驚訝的是,探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要,因為在這種情況下,這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI,將對整個醫療保健產業產生重大影響。結果:因此,我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後,我們提出理論架構,讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論:臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升,將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。 + +##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media** +2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton + +Stress is a pervasive global health issue that can lead to severe mental +health problems. Early detection offers timely intervention and prevention of +stress-related disorders. The current early detection models perform "black +box" inference suffering from limited explainability and trust which blocks the +real-world clinical application. Thanks to the generative properties introduced +by the Large Language Models (LLMs), the decision and the prediction from such +models are semi-interpretable through the corresponding description. However, +the existing LLMs are mostly trained for general purposes without the guidance +of psychological cognitive theory. To this end, we first highlight the +importance of prior theory with the observation of performance boosted by the +chain-of-thoughts tailored for stress detection. This method termed Cognition +Chain explicates the generation of stress through a step-by-step cognitive +perspective based on cognitive appraisal theory with a progress pipeline: +Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress +State, guiding LLMs to provide comprehensive reasoning explanations. We further +study the benefits brought by the proposed Cognition Chain format by utilising +it as a synthetic dataset generation template for LLMs instruction-tuning and +introduce CogInstruct, an instruction-tuning dataset for stress detection. This +dataset is developed using a three-stage self-reflective annotation pipeline +that enables LLMs to autonomously generate and refine instructional data. By +instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable +stress detection model. Evaluations demonstrate that CogLLM achieves +outstanding performance while enhancing explainability. Our work contributes a +novel approach by integrating cognitive theories into LLM reasoning processes, +offering a promising direction for future explainable AI research. + +摘要:壓力是一個普遍的全球性健康問題,可能會導致嚴重的精神 +健康問題。早期發現提供及時的干預和預防 +壓力相關疾病。目前的早期發現模型執行「黑 +盒子」推論,存在可解釋性和信任度有限的問題,阻礙了 +現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性,此類 +模型的決策和預測通過對應描述具有半可解釋性。然而, +現有的 LLM 主要針對一般用途進行訓練,沒有心理認知理論的指導。為此,我們首先強調 +先驗理論的重要性,並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知 +鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生,並具有進度管道: +刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力 +狀態,指導 LLM 提供全面的推理解釋。我們進一步 +通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點,並介紹 CogInstruct,這是一個針對壓力檢測的指令調整數據集。這個 +數據集是使用一個三階段的自省標註管道開發的,使 LLM 能夠自主生成和優化指令數據。通過 +使用 CogInstruct 對 Llama3 進行指令調整,我們開發了 CogLLM,這是一個可解釋的 +壓力檢測模型。評估表明,CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中,提出了一種新穎的方法, +為未來的可解釋人工智能研究提供了一個有希望的方向。 + +##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology** +2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi + +Human-machine teaming in medical AI requires us to understand to what degree +a trained clinician should weigh AI predictions. While previous work has shown +the potential of AI assistance at improving clinical predictions, existing +clinical decision support systems either provide no explainability of their +predictions or use techniques like saliency and Shapley values, which do not +allow for physician-based verification. To address this gap, this study +compares previously used explainable AI techniques with a newly proposed +technique termed '2-factor retrieval (2FR)', which is a combination of +interface design and search retrieval that returns similarly labeled data +without processing this data. This results in a 2-factor security blanket +where: (a) correct images need to be retrieved by the AI; and (b) humans should +associate the retrieved images with the current pathology under test. We find +that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician +accuracy, with particular improvements when clinicians are radiologists and +have low confidence in their decision. Our results highlight the importance of +understanding how different modes of human-AI decision making may impact +clinician accuracy in clinical decision support systems. + +摘要:人機協作在醫療 AI 中,需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力,但現有的臨床決策支援系統,要不就沒有提供預測的可解釋性,要不就是使用像顯著性和 Shapley 值之類的技術,這些技術不允許基於醫生的驗證。為了解決這個差距,本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較,後者是一種介面設計和搜尋檢索的組合,它會傳回標籤相似的資料,而不會處理這些資料。這會產生一個 2 因子安全機制,其中:(a) 正確的影像需要由 AI 檢索;(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現,當在胸部 X 光診斷上進行測試時,2FR 會提高臨床醫生的準確度,特別是在臨床醫生是放射科醫生且對其決策信心不足時,會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。 + +##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance** +2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle + +Understanding public perception of artificial intelligence (AI) and the +tradeoffs between potential risks and benefits is crucial, as these perceptions +might shape policy decisions, influence innovation trajectories for successful +market strategies, and determine individual and societal acceptance of AI +technologies. Using a representative sample of 1100 participants from Germany, +this study examines mental models of AI. Participants quantitatively evaluated +71 statements about AI's future capabilities (e.g., autonomous driving, medical +care, art, politics, warfare, and societal divides), assessing the expected +likelihood of occurrence, perceived risks, benefits, and overall value. We +present rankings of these projections alongside visual mappings illustrating +public risk-benefit tradeoffs. While many scenarios were deemed likely, +participants often associated them with high risks, limited benefits, and low +overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in +value assessment can be explained by perceived risks ($\beta=-.504$) and +perceived benefits ($\beta=+.710$), with no significant relation to expected +likelihood. Demographics and personality traits influenced perceptions of +risks, benefits, and overall evaluations, underscoring the importance of +increasing AI literacy and tailoring public information to diverse user needs. +These findings provide actionable insights for researchers, developers, and +policymakers by highlighting critical public concerns and individual factors +essential to align AI development with individual values. + +摘要:了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要,因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡,並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本,探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述(例如,自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧)進行了定量評估,評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名,並附上視覺化映射,說明了公眾的風險收益權衡。儘管許多場景被認為是可能的,但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中,96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋,與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法,這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素,為研究人員、開發人員和政策制定者提供了可行的見解。 + +##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset** +2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey + +The use of machine learning and AI on electronic health records (EHRs) holds +substantial potential for clinical insight. However, this approach faces +challenges due to data heterogeneity, sparsity, temporal misalignment, and +limited labeled outcomes. In this context, we leverage a linked EHR dataset of +approximately one million de-identified individuals from Bristol, North +Somerset, and South Gloucestershire, UK, to characterize urinary tract +infections (UTIs). We implemented a data pre-processing and curation pipeline +that transforms the raw EHR data into a structured format suitable for +developing predictive models focused on data fairness, accountability and +transparency. Given the limited availability and biases of ground truth UTI +outcomes, we introduce a UTI risk estimation framework informed by clinical +expertise to estimate UTI risk across individual patient timelines. Pairwise +XGBoost models are trained using this framework to differentiate UTI risk +categories with explainable AI techniques applied to identify key predictors +and support interpretability. Our findings reveal differences in clinical and +demographic predictors across risk groups. While this study highlights the +potential of AI-driven insights to support UTI clinical decision-making, +further investigation of patient sub-strata and extensive validation are needed +to ensure robustness and applicability in clinical practice. + +摘要:電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而,由於資料異質性、稀疏性、時間錯位和標籤結果有限,此方法面臨挑戰。在此背景下,我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集,來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線,適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差,我們引入了由臨床專業知識告知的 UTI 風險評估架構,以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練,以區分 UTI 風險類別,並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力,但仍需要進一步調查患者子群體和廣泛驗證,以確保在臨床實務中的穩健性和適用性。 + +##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care** +2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez + +There is a growing need to understand how digital systems can support +clinical decision-making, particularly as artificial intelligence (AI) models +become increasingly complex and less human-interpretable. This complexity +raises concerns about trustworthiness, impacting safe and effective adoption of +such technologies. Improved understanding of decision-making processes and +requirements for explanations coming from decision support tools is a vital +component in providing effective explainable solutions. This is particularly +relevant in the data-intensive, fast-paced environments of intensive care units +(ICUs). To explore these issues, group interviews were conducted with seven ICU +clinicians, representing various roles and experience levels. Thematic analysis +revealed three core themes: (T1) ICU decision-making relies on a wide range of +factors, (T2) the complexity of patient state is challenging for shared +decision-making, and (T3) requirements and capabilities of AI decision support +systems. We include design recommendations from clinical input, providing +insights to inform future AI systems for intensive care. + +摘要:隨著人工智慧 (AI) 模型變得越來越複雜,且越來越難以被人理解,了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮,影響了此類技術的安全且有效採用。改善對決策制定流程的理解,以及對決策支援工具所提供說明的要求,是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題,對七位 ICU 臨床醫師進行了小組訪談,這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題:(T1) ICU 決策制定依賴於廣泛的因素,(T2) 病患狀態的複雜性對共同決策制定構成挑戰,以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議,提供見解以提供資訊給未來用於加護的 AI 系統。 + +##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning** +2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra + +Pediatric heart diseases present a broad spectrum of congenital and acquired +diseases. More complex congenital malformations require a differentiated and +multimodal decision-making process, usually including echocardiography as a +central imaging method. Artificial intelligence (AI) offers considerable +promise for clinicians by facilitating automated interpretation of pediatric +echocardiography data. However, adapting AI technologies for pediatric +echocardiography analysis has challenges such as limited public data +availability, data privacy, and AI model transparency. Recently, researchers +have focused on disruptive technologies, such as federated learning (FL) and +explainable AI (XAI), to improve automatic diagnostic and decision support +workflows. This study offers a comprehensive overview of the limitations and +opportunities of AI in pediatric echocardiography, emphasizing the synergistic +workflow and role of XAI and FL, identifying research gaps, and exploring +potential future developments. Additionally, three relevant clinical use cases +demonstrate the functionality of XAI and FL with a focus on (i) view +recognition, (ii) disease classification, (iii) segmentation of cardiac +structures, and (iv) quantitative assessment of cardiac function. + +摘要:小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程,通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望,因為它可以促進小兒超音波檢查資料的自動化解讀。然而,將人工智慧技術應用於小兒超音波檢查分析有許多挑戰,例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近,研究人員專注於破壞性技術,例如聯合學習 (FL) 和可解釋人工智慧 (XAI),以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述,強調了 XAI 和 FL 的協同工作流程和角色,找出研究差距並探討潛在的未來發展。此外,三個相關的臨床使用案例展示了 XAI 和 FL 的功能,重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。 + +##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering** +2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust + +Osteoporosis is a common condition that increases fracture risk, especially +in older adults. Early diagnosis is vital for preventing fractures, reducing +treatment costs, and preserving mobility. However, healthcare providers face +challenges like limited labeled data and difficulties in processing medical +images. This study presents a novel multi-modal learning framework that +integrates clinical and imaging data to improve diagnostic accuracy and model +interpretability. The model utilizes three pre-trained networks-VGG19, +InceptionV3, and ResNet50-to extract deep features from X-ray images. These +features are transformed using PCA to reduce dimensionality and focus on the +most relevant components. A clustering-based selection process identifies the +most representative components, which are then combined with preprocessed +clinical data and processed through a fully connected network (FCN) for final +classification. A feature importance plot highlights key variables, showing +that Medical History, BMI, and Height were the main contributors, emphasizing +the significance of patient-specific data. While imaging features were +valuable, they had lower importance, indicating that clinical data are crucial +for accurate predictions. This framework promotes precise and interpretable +predictions, enhancing transparency and building trust in AI-driven diagnoses +for clinical integration. + +摘要:骨質疏鬆症是一種常見的疾病,會增加骨折的風險,特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而,醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架,該框架整合了臨床和影像數據,以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路,VGG19、InceptionV3 和 ResNet50,從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分,然後將這些組成部分與預處理的臨床數據結合,並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數,表明病史、BMI 和身高是主要貢獻因素,強調了患者特定數據的重要性。雖然影像特徵很有價值,但它們的重要性較低,這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測,提高了透明度,並建立了對 AI 驅動診斷在臨床整合中的信任。 + +##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection** +2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor + +This review paper explores recent advances in deep learning approaches for +non-invasive cognitive impairment detection. We examine various non-invasive +indicators of cognitive decline, including speech and language, facial, and +motoric mobility. The paper provides an overview of relevant datasets, +feature-extracting techniques, and deep-learning architectures applied to this +domain. We have analyzed the performance of different methods across modalities +and observed that speech and language-based methods generally achieved the +highest detection performance. Studies combining acoustic and linguistic +features tended to outperform those using a single modality. Facial analysis +methods showed promise for visual modalities but were less extensively studied. +Most papers focused on binary classification (impaired vs. non-impaired), with +fewer addressing multi-class or regression tasks. Transfer learning and +pre-trained language models emerged as popular and effective techniques, +especially for linguistic analysis. Despite significant progress, several +challenges remain, including data standardization and accessibility, model +explainability, longitudinal analysis limitations, and clinical adaptation. +Lastly, we propose future research directions, such as investigating +language-agnostic speech analysis methods, developing multi-modal diagnostic +systems, and addressing ethical considerations in AI-assisted healthcare. By +synthesizing current trends and identifying key obstacles, this review aims to +guide further development of deep learning-based cognitive impairment detection +systems to improve early diagnosis and ultimately patient outcomes. + +摘要:本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標,包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現,並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力,但研究較少。大多數論文專注於二元分類(受損與未受損),較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術,特別是對於語言分析。儘管取得了重大進展,但仍存在一些挑戰,包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後,我們提出了未來的研究方向,例如調查與語言無關的語音分析方法、開發多模式診斷系統,以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙,本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展,以改善早期診斷,並最終改善患者的治療結果。 + +##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems** +2410.17504v1 by Shruthi Chari + +Explainable Artificial Intelligence (AI) focuses on helping humans understand +the working of AI systems or their decisions and has been a cornerstone of AI +for decades. Recent research in explainability has focused on explaining the +workings of AI models or model explainability. There have also been several +position statements and review papers detailing the needs of end-users for +user-centered explainability but fewer implementations. Hence, this thesis +seeks to bridge some gaps between model and user-centered explainability. We +create an explanation ontology (EO) to represent literature-derived explanation +types via their supporting components. We implement a knowledge-augmented +question-answering (QA) pipeline to support contextual explanations in a +clinical setting. Finally, we are implementing a system to combine explanations +from different AI methods and data modalities. Within the EO, we can represent +fifteen different explanation types, and we have tested these representations +in six exemplar use cases. We find that knowledge augmentations improve the +performance of base large language models in the contextualized QA, and the +performance is variable across disease groups. In the same setting, clinicians +also indicated that they prefer to see actionability as one of the main foci in +explanations. In our explanations combination method, we plan to use similarity +metrics to determine the similarity of explanations in a chronic disease +detection setting. Overall, through this thesis, we design methods that can +support knowledge-enabled explanations across different use cases, accounting +for the methods in today's AI era that can generate the supporting components +of these explanations and domain knowledge sources that can enhance them. + +摘要:可解釋人工智慧(AI)專注於協助人類了解 AI 系統運作或其決策,數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求,但實作較少。因此,本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體(EO)以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答(QA)管線,以在臨床環境中支援情境解釋。最後,我們正在實作一個系統,以結合來自不同 AI 方法和資料模式的解釋。在 EO 中,我們可以表示 15 種不同的解釋類型,並且我們已在六個範例使用案例中測試這些表示。我們發現,知識增強改善了基礎大型語言模型在情境化 QA 中的效能,並且效能因疾病群組而異。在相同的環境中,臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中,我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言,透過本論文,我們設計了可以在不同使用案例中支援知識啟用解釋的方法,考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。 + +##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study** +2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay + +Objectives: To investigate clinicians' attitudes towards current automated +interpretation of ECG and novel AI technologies and their perception of +computer-assisted interpretation. Materials and Methods: We conducted a series +of interviews with clinicians in the UK. Our study: (i) explores the potential +for AI, specifically future 'human-like' computing approaches, to facilitate +ECG interpretation and support clinical decision making, and (ii) elicits their +opinions about the importance of explainability and trustworthiness of AI +algorithms. Results: We performed inductive thematic analysis on interview +transcriptions from 23 clinicians and identified the following themes: (i) a +lack of trust in current systems, (ii) positive attitudes towards future AI +applications and requirements for these, (iii) the relationship between the +accuracy and explainability of algorithms, and (iv) opinions on education, +possible deskilling, and the impact of AI on clinical competencies. Discussion: +Clinicians do not trust current computerised methods, but welcome future 'AI' +technologies. Where clinicians trust future AI interpretation to be accurate, +they are less concerned that it is explainable. They also preferred ECG +interpretation that demonstrated the results of the algorithm visually. Whilst +clinicians do not fear job losses, they are concerned about deskilling and the +need to educate the workforce to use AI responsibly. Conclusion: Clinicians are +positive about the future application of AI in clinical decision-making. +Accuracy is a key factor of uptake and visualisations are preferred over +current computerised methods. This is viewed as a potential means of training +and upskilling, in contrast to the deskilling that automation might be +perceived to bring. + +摘要:目的:調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度,以及他們對電腦輔助解讀的看法。材料和方法:我們對英國的臨床醫生進行了一系列訪談。我們的研究:(i) 探討人工智慧的潛力,特別是未來的「類人類」運算方法,以促進心電圖解讀並支持臨床決策制定,以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果:我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析,並找出以下主題:(i) 對目前系統缺乏信任,(ii) 對未來人工智慧應用和對這些應用的要求持正面態度,(iii) 演算法的準確性和可解釋性之間的關係,以及 (iv) 對教育、可能的技能退化,以及人工智慧對臨床能力的影響的看法。討論:臨床醫生不信任目前的電腦化方法,但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下,他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業,但他們擔心技能退化,以及需要教育員工負責任地使用人工智慧。結論:臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素,而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法,與自動化可能帶來的技能退化形成對比。 + +##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer** +2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker + +The aggressiveness of prostate cancer, the most common cancer in men +worldwide, is primarily assessed based on histopathological data using the +Gleason scoring system. While artificial intelligence (AI) has shown promise in +accurately predicting Gleason scores, these predictions often lack inherent +explainability, potentially leading to distrust in human-machine interactions. +To address this issue, we introduce a novel dataset of 1,015 tissue microarray +core images, annotated by an international group of 54 pathologists. The +annotations provide detailed localized pattern descriptions for Gleason grading +in line with international guidelines. Utilizing this dataset, we develop an +inherently explainable AI system based on a U-Net architecture that provides +predictions leveraging pathologists' terminology. This approach circumvents +post-hoc explainability methods while maintaining or exceeding the performance +of methods trained directly for Gleason pattern segmentation (Dice score: 0.713 +$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason +patterns). By employing soft labels during training, we capture the intrinsic +uncertainty in the data, yielding strong results in Gleason pattern +segmentation even in the context of high interobserver variability. With the +release of this dataset, we aim to encourage further research into segmentation +in medical tasks with high levels of subjectivity and to advance the +understanding of pathologists' reasoning processes. + +摘要:前列腺癌是全球男性最常見的癌症,其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力,但這些預測通常缺乏內在的可解釋性,可能會導致對人機互動的不信任。為了解決這個問題,我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述,用於符合國際準則的 Gleason 分級。利用這個資料集,我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統,該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法,同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能(Dice 分數:0.713 ± 0.003,訓練於解釋,相對於 0.691 ± 0.010,訓練於 Gleason 模式)。透過在訓練期間採用軟標籤,我們捕捉了資料中的內在不確定性,即使在觀察者間變異性高的情況下,也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集,我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割,並增進對病理學家推理過程的理解。 + +##### **Explainable AI Methods for Multi-Omics Analysis: A Survey** +2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee + +Advancements in high-throughput technologies have led to a shift from +traditional hypothesis-driven methodologies to data-driven approaches. +Multi-omics refers to the integrative analysis of data derived from multiple +'omes', such as genomics, proteomics, transcriptomics, metabolomics, and +microbiomics. This approach enables a comprehensive understanding of biological +systems by capturing different layers of biological information. Deep learning +methods are increasingly utilized to integrate multi-omics data, offering +insights into molecular interactions and enhancing research into complex +diseases. However, these models, with their numerous interconnected layers and +nonlinear relationships, often function as black boxes, lacking transparency in +decision-making processes. To overcome this challenge, explainable artificial +intelligence (xAI) methods are crucial for creating transparent models that +allow clinicians to interpret and work with complex data more effectively. This +review explores how xAI can improve the interpretability of deep learning +models in multi-omics research, highlighting its potential to provide +clinicians with clear insights, thereby facilitating the effective application +of such models in clinical settings. + +摘要:高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料,例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面,能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料,提供分子交互作用的洞察力,並加強對複雜疾病的研究。然而,這些模型具有許多相互連接的層級和非線性關係,通常會像黑盒子一樣運作,缺乏決策過程的透明度。為了克服此挑戰,可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要,讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性,強調其提供臨床醫生明確見解的潛力,進而促進此類模型在臨床環境中的有效應用。 + +##### **Study on the Helpfulness of Explainable Artificial Intelligence** +2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing + +Explainable Artificial Intelligence (XAI) is essential for building advanced +machine learning-powered applications, especially in critical domains such as +medical diagnostics or autonomous driving. Legal, business, and ethical +requirements motivate using effective XAI, but the increasing number of +different methods makes it challenging to pick the right ones. Further, as +explanations are highly context-dependent, measuring the effectiveness of XAI +methods without users can only reveal a limited amount of information, +excluding human factors such as the ability to understand it. We propose to +evaluate XAI methods via the user's ability to successfully perform a proxy +task, designed such that a good performance is an indicator for the explanation +to provide helpful information. In other words, we address the helpfulness of +XAI for human decision-making. Further, a user study on state-of-the-art +methods was conducted, showing differences in their ability to generate trust +and skepticism and the ability to judge the rightfulness of an AI decision +correctly. Based on the results, we highly recommend using and extending this +approach for more objective-based human-centered user studies to measure XAI +performance in an end-to-end fashion. + +摘要:可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要,特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI,但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外,由於解釋高度依賴於背景,在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊,排除人類因素,例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法,設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說,我們探討 XAI 對人類決策制定的幫助。此外,對最先進的方法進行使用者研究,顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果,我們強烈建議使用和擴充這種方法,以進行更多以目標為基礎的人為中心使用者研究,以終端到終端的方式衡量 XAI 效能。 + +##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health** +2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh + +Early detection of intrapartum risk enables interventions to potentially +prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, +there is no accurate automated system to predict such events to assist with +clinical decision-making. To fill this gap, we propose "Artificial Intelligence +(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning +framework that not only predicts adverse labor outcomes from maternal, fetal, +obstetrical, and intrapartum risk factors but also provides the model's +reasoning behind the predictions made. The latter can provide insights into +what modifications in the input variables of the model could have changed the +predicted outcome. We address the challenges of imbalance and small datasets by +synthesizing additional training data using Adaptive Synthetic Sampling +(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN +uses an ensemble of fully-connected neural networks as the backbone for its +classification with the data augmentation supported by either ADASYN or CTGAN. +AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in +classification. AIMEN can predict a high risk for adverse labor outcomes with +an average F1 score of 0.784. It also provides counterfactual explanations that +can be achieved by changing 2 to 3 attributes on average. Resources available: +https://github.com/ab9mamun/AIMEN. + +摘要:產程中風險的早期偵測有助於進行干預措施,以預防或減輕不利的生產結果,例如腦性麻痺。目前,沒有準確的自動化系統可以預測此類事件,以協助臨床決策。為了填補這一空白,我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN),這是一個深度學習架構,它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果,還能提供模型做出預測背後的原因。後者可以提供見解,說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料,以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹,並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險,平均 F1 分數為 0.784。它還提供反事實解釋,可透過平均變更 2 至 3 個屬性來達成。可用資源:https://github.com/ab9mamun/AIMEN。 + +##### **Artificial intelligence techniques in inherited retinal diseases: A review** +2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian + +Inherited retinal diseases (IRDs) are a diverse group of genetic disorders +that lead to progressive vision loss and are a major cause of blindness in +working-age adults. The complexity and heterogeneity of IRDs pose significant +challenges in diagnosis, prognosis, and management. Recent advancements in +artificial intelligence (AI) offer promising solutions to these challenges. +However, the rapid development of AI techniques and their varied applications +have led to fragmented knowledge in this field. This review consolidates +existing studies, identifies gaps, and provides an overview of AI's potential +in diagnosing and managing IRDs. It aims to structure pathways for advancing +clinical applications by exploring AI techniques like machine learning and deep +learning, particularly in disease detection, progression prediction, and +personalized treatment planning. Special focus is placed on the effectiveness +of convolutional neural networks in these areas. Additionally, the integration +of explainable AI is discussed, emphasizing its importance in clinical settings +to improve transparency and trust in AI-based systems. The review addresses the +need to bridge existing gaps in focused studies on AI's role in IRDs, offering +a structured analysis of current AI techniques and outlining future research +directions. It concludes with an overview of the challenges and opportunities +in deploying AI for IRDs, highlighting the need for interdisciplinary +collaboration and the continuous development of robust, interpretable AI models +to advance clinical applications. + +摘要:遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病, +會導致視力逐漸喪失,是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。 +然而,AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究,找出差距,並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術,特別是在疾病檢測、進程預測和個性化治療計劃中,為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外,討論了可解釋 AI 的整合,強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性,提供了對當前 AI 技術的結構化分析,並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇,強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。 + +##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures** +2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri + +Explaining Artificial Intelligence (AI) decisions is a major challenge +nowadays in AI, in particular when applied to sensitive scenarios like medicine +and law. However, the need to explain the rationale behind decisions is a main +issue also for human-based deliberation as it is important to justify +\textit{why} a certain decision has been taken. Resident medical doctors for +instance are required not only to provide a (possibly correct) diagnosis, but +also to explain how they reached a certain conclusion. Developing new tools to +aid residents to train their explanation skills is therefore a central +objective of AI in education. In this paper, we follow this direction, and we +present, to the best of our knowledge, the first multilingual dataset for +Medical Question Answering where correct and incorrect diagnoses for a clinical +case are enriched with a natural language explanation written by doctors. These +explanations have been manually annotated with argument components (i.e., +premise, claim) and argument relations (i.e., attack, support), resulting in +the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases +in four languages (English, Spanish, French, Italian) with explanations, where +we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 +attack relations. We conclude by showing how competitive baselines perform over +this challenging dataset for the argument mining task. + +摘要:解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰,特別是應用於像醫學和法律等敏感情境時。然而,解釋決策背後理由的需求也是基於人類的考量的一個主要問題,因為有必要證明為什麼做出某個決策。例如,住院醫師不僅需要提供(可能是正確的)診斷,還需要解釋他們如何達成某個結論。因此,開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中,我們遵循這個方向,並且根據我們的了解,提出第一個多語言醫學問答資料集,其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成(即前提、主張)和論證關係(即攻擊、支持)進行手動註解,產生多語言 CasiMedicos-Arg 資料集,其中包含 558 個具有解釋的四種語言(英語、西班牙語、法語、義大利語)的臨床病例,我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。 + +##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration** +2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu + +Diagnosis prediction is a critical task in healthcare, where timely and +accurate identification of medical conditions can significantly impact patient +outcomes. Traditional machine learning and deep learning models have achieved +notable success in this domain but often lack interpretability which is a +crucial requirement in clinical settings. In this study, we explore the use of +neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop +explainable models for diagnosis prediction. Essentially, we design and +implement LNN-based models that integrate domain-specific knowledge through +logical rules with learnable thresholds. Our models, particularly +$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior +performance over traditional models such as Logistic Regression, SVM, and +Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up +to 0.8457) in the case study of diabetes prediction. The learned weights and +thresholds within the LNN models provide direct insights into feature +contributions, enhancing interpretability without compromising predictive +power. These findings highlight the potential of neuro-symbolic approaches in +bridging the gap between accuracy and explainability in healthcare AI +applications. By offering transparent and adaptable diagnostic models, our work +contributes to the advancement of precision medicine and supports the +development of equitable healthcare solutions. Future research will focus on +extending these methods to larger and more diverse datasets to further validate +their applicability across different medical conditions and populations. + +摘要:診斷預測是醫療保健中的關鍵任務,及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功,但通常缺乏可解釋性,這在臨床環境中是一項關鍵要求。在本研究中,我們探討了神經符號方法的應用,特別是邏輯神經網路 (LNN),以開發用於診斷預測的可解釋模型。基本上,我們設計並實作了基於 LNN 的模型,這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型,特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$,表現出優於傳統模型(例如邏輯迴歸、SVM 和隨機森林)的優異效能,在糖尿病預測的案例研究中達到了更高的準確度(高達 80.52%)和 AUROC 分數(高達 0.8457)。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解,增強了可解釋性,同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型,我們的研究有助於推進精準醫療,並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集,以進一步驗證其在不同醫療狀況和人群中的適用性。 + +##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare** +2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty + +The rapid advancements in artificial intelligence (AI) have revolutionized +smart healthcare, driving innovations in wearable technologies, continuous +monitoring devices, and intelligent diagnostic systems. However, security, +explainability, robustness, and performance optimization challenges remain +critical barriers to widespread adoption in clinical environments. This +research presents an innovative algorithmic method using the Adaptive Feature +Evaluator (AFE) algorithm to improve feature selection in healthcare datasets +and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable +Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT), +the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby +enhancing predictive accuracy and interpretability. The proposed method is +validated across three diverse healthcare datasets using six distinct machine +learning algorithms, demonstrating its robustness and superiority over +conventional feature selection techniques. The results underscore the +transformative potential of AFE in smart healthcare, enabling personalized and +transparent patient care. Notably, the AFE algorithm, when combined with a +Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting +its capability to improve clinical decision-making processes in real-world +healthcare applications. + +摘要:人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健,推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而,安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法,使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT),該演算法最佳化了臨床決策支援系統 (CDSS),從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集,證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力,實現了個人化和透明的患者照護。值得注意的是,AFE 演算法與多層感知器 (MLP) 結合使用時,準確度高達 98.5%,突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。 + +##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study** +2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker + +Artificial intelligence (AI) systems have substantially improved +dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI) +systems further enhancing clinicians' confidence and trust in AI-driven +decisions. Despite these advancements, there remains a critical need for +objective evaluation of how dermatologists engage with both AI and XAI tools. +In this study, 76 dermatologists participated in a reader study, diagnosing 16 +dermoscopic images of melanomas and nevi using an XAI system that provides +detailed, domain-specific explanations. Eye-tracking technology was employed to +assess their interactions. Diagnostic performance was compared with that of a +standard AI system lacking explanatory features. Our findings reveal that XAI +systems improved balanced diagnostic accuracy by 2.8 percentage points relative +to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and +complex lesions were associated with elevated cognitive load, as evidenced by +increased ocular fixations. These insights have significant implications for +clinical practice, the design of AI tools for visual tasks, and the broader +development of XAI in medical diagnostics. + +摘要:人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度,而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展,對於皮膚科醫師如何使用 AI 和 XAI 工具,仍有客觀評估的迫切需求。在這項研究中,76 位皮膚科醫師參與了一項讀者研究,使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像,該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示,XAI 系統相較於標準 AI,將平衡診斷準確度提升了 2.8 個百分點。此外,與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關,這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。 + +##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data** +2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar + +Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been +shown to significantly improve the quality of life of autistic individuals. +However, diagnostics methods for ASD rely on assessments based on clinical +presentation that are prone to bias and can be challenging to arrive at an +early diagnosis. There is a need for objective biomarkers of ASD which can help +improve diagnostic accuracy. Deep learning (DL) has achieved outstanding +performance in diagnosing diseases and conditions from medical imaging data. +Extensive research has been conducted on creating models that classify ASD +using resting-state functional Magnetic Resonance Imaging (fMRI) data. However, +existing models lack interpretability. This research aims to improve the +accuracy and interpretability of ASD diagnosis by creating a DL model that can +not only accurately classify ASD but also provide explainable insights into its +working. The dataset used is a preprocessed version of the Autism Brain Imaging +Data Exchange (ABIDE) with 884 samples. Our findings show a model that can +accurately classify ASD and highlight critical brain regions differing between +ASD and typical controls, with potential implications for early diagnosis and +understanding of the neural basis of ASD. These findings are validated by +studies in the literature that use different datasets and modalities, +confirming that the model actually learned characteristics of ASD and not just +the dataset. This study advances the field of explainable AI in medical imaging +by providing a robust and interpretable model, thereby contributing to a future +with objective and reliable ASD diagnostics. + +摘要:自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而,ASD 的診斷方法依賴於基於臨床表現的評估,容易產生偏見,且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記,以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而,現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD,還能提供可解釋見解說明其運作原理的 DL 模型,來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本,包含 884 個樣本。我們的研究結果顯示,該模型能準確分類 ASD,並強調 ASD 與典型對照組之間存在差異的關鍵腦區,對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證,證實該模型實際上學習了 ASD 的特徵,而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型,推動了醫學影像中可解釋 AI 的領域,從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。 + +##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition** +2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul + +The in-vivo identification of the kidney stone types during an ureteroscopy +would be a major medical advance in urology, as it could reduce the time of the +tedious renal calculi extraction process, while diminishing infection risks. +Furthermore, such an automated procedure would make possible to prescribe +anti-recurrence treatments immediately. Nowadays, only few experienced +urologists are able to recognize the kidney stone types in the images of the +videos displayed on a screen during the endoscopy. Thus, several deep learning +(DL) models have recently been proposed to automatically recognize the kidney +stone types using ureteroscopic images. However, these DL models are of black +box nature whicl limits their applicability in clinical settings. This +contribution proposes a case-based reasoning DL model which uses prototypical +parts (PPs) and generates local and global descriptors. The PPs encode for each +class (i.e., kidney stone type) visual feature information (hue, saturation, +intensity and textures) similar to that used by biologists. The PPs are +optimally generated due a new loss function used during the model training. +Moreover, the local and global descriptors of PPs allow to explain the +decisions ("what" information, "where in the images") in an understandable way +for biologists and urologists. The proposed DL model has been tested on a +database including images of the six most widespread kidney stone types. The +overall average classification accuracy was 90.37. When comparing this results +with that of the eight other DL models of the kidney stone state-of-the-art, it +can be seen that the valuable gain in explanability was not reached at the +expense of accuracy which was even slightly increased with respect to that +(88.2) of the best method of the literature. These promising and interpretable +results also encourage urologists to put their trust in AI-based solutions. + +摘要:尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展,因為它可以減少繁瑣的腎結石取出過程的時間,同時降低感染風險。此外,這種自動化程序將使立即開立抗復發治療成為可能。如今,只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此,最近已提出多種深度學習 (DL) 模型,以使用輸尿管鏡圖像自動識別腎結石類型。然而,這些 DL 模型本質上是黑盒子,這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型,它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型(即腎結石類型)編碼視覺特徵信息(色調、飽和度、強度和紋理),類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數,PP 得到了最佳生成。此外,PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策(“什麼”信息,“圖像中的什麼位置”)。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時,可以看出,可解釋性的寶貴增益並未以準確性為代價,甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。 + +##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques** +2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman + +This study explores the potential of utilizing administrative claims data, +combined with advanced machine learning and deep learning techniques, to +predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal +Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major +health insurance organization to develop prediction models for multiple +observation windows using traditional machine learning methods such as Random +Forest and XGBoost as well as deep learning approaches such as Long Short-Term +Memory (LSTM) networks. Our findings demonstrate that the LSTM model, +particularly with a 24-month observation window, exhibits superior performance +in predicting ESRD progression, outperforming existing models in the +literature. We further apply SHapley Additive exPlanations (SHAP) analysis to +enhance interpretability, providing insights into the impact of individual +features on predictions at the individual patient level. This study underscores +the value of leveraging administrative claims data for CKD management and +predicting ESRD progression. + +摘要:本研究探討利用行政申報資料,結合先進機器學習與深度學習技術,預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集,使用傳統機器學習方法(例如隨機森林和 XGBoost)以及深度學習方法(例如長期短期記憶 (LSTM) 網路)開發多個觀察視窗的預測模型。我們的研究結果顯示,LSTM 模型(尤其是 24 個月觀察視窗)在預測 ESRD 進展方面表現優異,優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性,深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。 + +##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases** +2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller + +While large language models (LLMs) have shown promise for medical question +answering, there is limited work focused on tropical and infectious +disease-specific exploration. We build on an opensource tropical and infectious +diseases (TRINDs) dataset, expanding it to include demographic and semantic +clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM +performance on these, comparing generalist and medical LLMs, as well as LLM +outcomes to human experts. We demonstrate through systematic experimentation, +the benefit of contextual information such as demographics, location, gender, +risk factors for optimal LLM response. Finally we develop a prototype of +TRINDs-LM, a research tool that provides a playground to navigate how context +impacts LLM outputs for health. + +摘要:儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景,但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上,並將其擴展為納入人口統計和語義臨床和消費者擴充,產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能,比較了通才和醫療 LLM,以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊(例如人口統計、位置、性別、最佳 LLM 回應的風險因素)的好處。最後,我們開發了 TRINDs-LM 的原型,這是一個研究工具,提供一個探索背景如何影響 LLM 健康輸出的平台。 + +##### **Explainable AI: Definition and attributes of a good explanation for health AI** +2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group + +Proposals of artificial intelligence (AI) solutions based on increasingly +complex and accurate predictive models are becoming ubiquitous across many +disciplines. As the complexity of these models grows, transparency and users' +understanding often diminish. This suggests that accurate prediction alone is +insufficient for making an AI-based solution truly useful. In the development +of healthcare systems, this introduces new issues related to accountability and +safety. Understanding how and why an AI system makes a recommendation may +require complex explanations of its inner workings and reasoning processes. +Although research on explainable AI (XAI) has significantly increased in recent +years and there is high demand for XAI in medicine, defining what constitutes a +good explanation remains ad hoc, and providing adequate explanations continues +to be challenging. To fully realize the potential of AI, it is critical to +address two fundamental questions about explanations for safety-critical AI +applications, such as health-AI: (1) What is an explanation in health-AI? and +(2) What are the attributes of a good explanation in health-AI? In this study, +we examined published literature and gathered expert opinions through a +two-round Delphi study. The research outputs include (1) a definition of what +constitutes an explanation in health-AI and (2) a comprehensive list of +attributes that characterize a good explanation in health-AI. + +摘要:隨著越來越複雜且準確的預測模型,基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加,透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中,這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加,且醫學領域對 XAI 有很高的需求,但定義什麼構成一個好的解釋仍是臨時性的,而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力,對於安全關鍵型 AI 應用(例如健康 AI)的解釋,探討兩個基本問題至關重要:(1) 什麼是健康 AI 中的解釋?以及 (2) 健康 AI 中一個好的解釋有哪些屬性?在本研究中,我們檢視了已發表的文獻,並透過兩輪德爾菲研究收集了專家意見。研究成果包括:(1) 健康 AI 中什麼構成解釋的定義,以及 (2) 健康 AI 中一個好解釋的屬性清單。 + +##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare** +2408.17401v2 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni + +AI-driven tools for healthcare are widely acknowledged as potentially +beneficial to health practitioners and patients, e.g. the QCancer regression +tool for cancer risk prediction. However, for these tools to be trusted, they +need to be supplemented with explanations. We examine how explanations' content +and format affect user comprehension and trust when explaining QCancer's +predictions. Regarding content, we deploy SHAP and Occlusion-1. Regarding +format, we present SHAP explanations, conventionally, as charts (SC) and +Occlusion-1 explanations as charts (OC) as well as text (OT), to which their +simpler nature lends itself. We conduct experiments with two sets of +stakeholders: the general public (representing patients) and medical students +(representing healthcare practitioners). Our experiments showed higher +subjective comprehension and trust for Occlusion-1 over SHAP explanations based +on content. However, when controlling for format, only OT outperformed SC, +suggesting this trend is driven by preferences for text. Other findings +corroborated that explanation format, rather than content, is often the +critical factor. + +摘要:由 AI 驅動的醫療保健工具被廣泛認為對醫療從業者和患者有潛在好處,例如用於癌症風險預測的 QCancer 回歸工具。然而,對於這些工具,如果要讓人們信賴,就需要補充說明。我們研究了說明的內容和格式如何影響使用者在解釋 QCancer 預測時的理解和信任。關於內容,我們部署了 SHAP 和 Occlusion-1。關於格式,我們以圖表 (SC) 的形式呈現 SHAP 說明,以圖表 (OC) 和文字 (OT) 的形式呈現 Occlusion-1 說明,因為它們的性質較為簡單。我們對兩組利害關係人進行了實驗:一般民眾(代表患者)和醫學生(代表醫療從業者)。我們的實驗結果顯示,基於內容,Occlusion-1 比 SHAP 說明具有更高的主觀理解和信任。然而,在控制格式時,只有 OT 優於 SC,這表明這種趨勢是由對文字的偏好所驅動的。其他發現證實了說明格式,而不是內容,通常是關鍵因素。 + +##### **A Survey for Large Language Models in Biomedicine** +2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen + +Recent breakthroughs in large language models (LLMs) offer unprecedented +natural language understanding and generation capabilities. However, existing +surveys on LLMs in biomedicine often focus on specific applications or model +architectures, lacking a comprehensive analysis that integrates the latest +advancements across various biomedical domains. This review, based on an +analysis of 484 publications sourced from databases including PubMed, Web of +Science, and arXiv, provides an in-depth examination of the current landscape, +applications, challenges, and prospects of LLMs in biomedicine, distinguishing +itself by focusing on the practical implications of these models in real-world +biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot +learning across a broad spectrum of biomedical tasks, including diagnostic +assistance, drug discovery, and personalized medicine, among others, with +insights drawn from 137 key studies. Then, we discuss adaptation strategies of +LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to +enhance their performance in specialized biomedical contexts where zero-shot +fails to achieve, such as medical question answering and efficient processing +of biomedical literature. Finally, we discuss the challenges that LLMs face in +the biomedicine domain including data privacy concerns, limited model +interpretability, issues with dataset quality, and ethics due to the sensitive +nature of biomedical data, the need for highly reliable model outputs, and the +ethical implications of deploying AI in healthcare. To address these +challenges, we also identify future research directions of LLM in biomedicine +including federated learning methods to preserve data privacy and integrating +explainable AI methodologies to enhance the transparency of LLMs. + +摘要:大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而,現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構,缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析,深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景,其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先,我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力,包括診斷輔助、藥物發現和個性化醫療等,並從 137 項關鍵研究中汲取見解。然後,我們討論了 LLM 的適應策略,包括單模態和多模態 LLM 的微調方法,以增強它們在零次學習無法實現的專業生物醫學背景中的性能,例如醫療問題解答和生物醫學文獻的有效處理。最後,我們討論了 LLM 在生物醫學領域面臨的挑戰,包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰,我們還確定了生物醫學中 LLM 未來的研究方向,包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。 + +##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis** +2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone + +Significant investment and development have gone into integrating Artificial +Intelligence (AI) in medical and healthcare applications, leading to advanced +control systems in medical technology. However, the opacity of AI systems +raises concerns about essential characteristics needed in such sensitive +applications, like transparency and trustworthiness. Our study addresses these +concerns by investigating a process for selecting the most adequate Explainable +AI (XAI) methods to comply with the explanation requirements of key EU +regulations in the context of smart bioelectronics for medical devices. The +adopted methodology starts with categorising smart devices by their control +mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving +into their technology. Then, we analyse these regulations to define their +explainability requirements for the various devices and related goals. +Simultaneously, we classify XAI methods by their explanatory objectives. This +allows for matching legal explainability requirements with XAI explanatory +goals and determining the suitable XAI algorithms for achieving them. Our +findings provide a nuanced understanding of which XAI algorithms align better +with EU regulations for different types of medical devices. We demonstrate this +through practical case studies on different neural implants, from chronic +disease management to advanced prosthetics. This study fills a crucial gap in +aligning XAI applications in bioelectronics with stringent provisions of EU +regulations. It provides a practical framework for developers and researchers, +ensuring their AI innovations advance healthcare technology and adhere to legal +and ethical standards. + +摘要:人工智慧(AI)在醫療和保健應用中投入了大量的投資和開發,進而導致醫療技術中的先進控制系統。然而,AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂,例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題,用於選擇最充分的可解釋 AI(XAI)方法,以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制(開迴路、閉迴路和半閉迴路系統)對智慧型裝置進行分類,並深入探討其技術開始。然後,我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時,我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配,並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點,從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構,確保其 AI 創新能促進醫療技術並遵守法律和道德標準。 + +##### **Towards Case-based Interpretability for Medical Federated Learning** +2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva + +We explore deep generative models to generate case-based explanations in a +medical federated learning setting. Explaining AI model decisions through +case-based interpretability is paramount to increasing trust and allowing +widespread adoption of AI in clinical practice. However, medical AI training +paradigms are shifting towards federated learning settings in order to comply +with data protection regulations. In a federated scenario, past data is +inaccessible to the current user. Thus, we use a deep generative model to +generate synthetic examples that protect privacy and explain decisions. Our +proof-of-concept focuses on pleural effusion diagnosis and uses publicly +available Chest X-ray data. + +摘要:我們探索深度生成模型,在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策,對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而,醫療 AI 訓練範例正轉向聯邦學習設置,以符合資料保護法規。在聯邦情境中,過去的資料對目前的使用者而言是無法取得的。因此,我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷,並使用公開可取得的胸部 X 光資料。 + +##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines** +2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans + +Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging +lesions with variable clinical behaviours and treatment approaches. This +systematic review provides an overview of Artificial Intelligence (AI) methods +using radiological imaging for diagnosis and prognosis of these tumours, +highlighting challenges in clinical translation, and evaluating study alignment +with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI +international consensus guidelines for trustworthy and deployable AI to promote +the clinical translation of AI methods. The review covered literature from +several bibliographic databases, including papers published before 17/07/2024. +Original research in peer-reviewed journals focused on radiology-based AI for +diagnosing or prognosing primary STBT was included. Exclusion criteria were +animal, cadaveric, or laboratory studies, and non-English papers. Abstracts +were screened by two of three independent reviewers for eligibility. Eligible +papers were assessed against guidelines by one of three independent reviewers. +The search identified 15,015 abstracts, from which 325 articles were included +for evaluation. Most studies performed moderately on CLAIM, averaging a score +of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out +of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage, +indicating significant room for improvement. Future efforts by AI developers +should focus on design (e.g. define unmet clinical need, intended clinical +setting and how AI would be integrated in clinical workflow), development (e.g. +build on previous work, explainability), evaluation (e.g. evaluating and +addressing biases, evaluating AI against best practices), and data +reproducibility and availability (making documented code and data publicly +available). Following these recommendations could improve clinical translation +of AI methods. + +摘要:軟組織和骨骼腫瘤(STBT)是罕見、診斷具有挑戰性的病灶,其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀,重點說明了臨床轉譯的挑戰,並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性,以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻,包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究,以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要,其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等,平均得分為 53 分中的 28.9±7.5 分,但在 FUTURE-AI 中表現不佳,平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段,表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計(例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中)、開發(例如建立在先前的工作、可解釋性)、評估(例如評估和解決偏差、評估 AI 與最佳實務)、以及數據可複製性和可用性(公開提供文件化的代碼和數據)。遵循這些建議可以改善 AI 方法的臨床轉譯。 + +##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy** +2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen + +Early detection of Cerebral Palsy (CP) is crucial for effective intervention +and monitoring. This paper tests the reliability and applicability of +Explainable AI (XAI) methods using a deep learning method that predicts CP by +analyzing skeletal data extracted from video recordings of infant movements. +Specifically, we use XAI evaluation metrics -- namely faithfulness and +stability -- to quantitatively assess the reliability of Class Activation +Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this +specific medical application. We utilize a unique dataset of infant movements +and apply skeleton data perturbations without distorting the original dynamics +of the infant movements. Our CP prediction model utilizes an ensemble approach, +so we evaluate the XAI metrics performances for both the overall ensemble and +the individual models. Our findings indicate that both XAI methods effectively +identify key body points influencing CP predictions and that the explanations +are robust against minor data perturbations. Grad-CAM significantly outperforms +CAM in the RISv metric, which measures stability in terms of velocity. In +contrast, CAM performs better in the RISb metric, which relates to bone +stability, and the RRS metric, which assesses internal representation +robustness. Individual models within the ensemble show varied results, and +neither CAM nor Grad-CAM consistently outperform the other, with the ensemble +approach providing a representation of outcomes from its constituent models. + +摘要:腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性,使用深度學習方法,透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說,我們使用 XAI 評估指標(即忠實度和穩定性)來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集,並應用骨骼資料擾動,而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法,因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明,兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位,並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM,該指標衡量速度方面的穩定性。相比之下,CAM 在 RISb 指標中表現得更好,該指標與骨骼穩定性有關,而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果,CAM 和 Grad-CAM 都不一致地優於另一種,整體方法提供了其組成模型結果的表示。 + +##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy** +2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma + +Recent global estimates suggest that as many as 2.41 billion individuals have +health conditions that would benefit from rehabilitation services. Home-based +Physical Therapy (PT) faces significant challenges in providing interactive +feedback and meaningful observation for therapists and patients. To fill this +gap, we present MicroXercise, which integrates micro-motion analysis with +wearable sensors, providing therapists and patients with a comprehensive +feedback interface, including video, text, and scores. Crucially, it employs +multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable +methods to analyze the existing deep learning neural networks in monitoring +exercises, focusing on a high granularity of exercise. This synergistic +approach is pivotal, providing output matching the input size to precisely +highlight critical subtleties and movements in PT, thus transforming complex AI +analysis into clear, actionable feedback. By highlighting these micro-motions +in different metrics, such as stability and range of motion, MicroXercise +significantly enhances the understanding and relevance of feedback for +end-users. Comparative performance metrics underscore its effectiveness over +traditional methods, such as a 39% and 42% improvement in Feature Mutual +Information (FMI) and Continuity. MicroXercise is a step ahead in home-based +physical therapy, providing a technologically advanced and intuitively helpful +solution to enhance patient care and outcomes. + +摘要:最近的全球估計表明,多達 24.1 億人有 +健康狀況可從復健服務中受益。居家 +物理治療 (PT) 在提供互動式 +回饋和有意義的觀察方面面臨重大挑戰,供治療師和患者使用。為了填補這 +個缺口,我們提出 MicroXercise,它將微動作分析與 +可穿戴式感測器整合在一起,為治療師和患者提供一個全面的 +回饋介面,包括影片、文字和分數。至關重要的是,它採用 +多維動態時間規整 (DTW) 和基於歸因的可解釋 +方法來分析監控運動中現有的深度學習神經網路,專注於運動的高粒度。這種協同 +方法至關重要,提供與輸入大小匹配的輸出,以精確地 +突出 PT 中關鍵的細微差別和動作,從而將複雜的 AI +分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作,例如穩定性和動作範圍,MicroXercise +顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於 +傳統方法的有效性,例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家 +物理治療方面更進一步,提供技術先進且直覺有用的 +解決方案,以提升患者照護和結果。 + +##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development** +2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz + +Systematic literature reviews are the highest quality of evidence in +research. However, the review process is hindered by significant resource and +data constraints. The Literature Review Network (LRN) is the first of its kind +explainable AI platform adhering to PRISMA 2020 standards, designed to automate +the entire literature review process. LRN was evaluated in the domain of +surgical glove practices using 3 search strings developed by experts to query +PubMed. A non-expert trained all LRN models. Performance was benchmarked +against an expert manual review. Explainability and performance metrics +assessed LRN's ability to replicate the experts' review. Concordance was +measured with the Jaccard index and confusion matrices. Researchers were +blinded to the other's results until study completion. Overlapping studies were +integrated into an LRN-generated systematic review. LRN models demonstrated +superior classification accuracy without expert training, achieving 84.78% and +85.71% accuracy. The highest performance model achieved high interrater +reliability (k = 0.4953) and explainability metrics, linking 'reduce', +'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51% +of the relevant literature despite diverging from the non-expert's judgments (k += 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN +outperformed the manual review (19,920 minutes over 11 months), reducing the +entire process to 288.6 minutes over 5 days. This study demonstrates that +explainable AI does not require expert training to successfully conduct +PRISMA-compliant systematic literature reviews like an expert. LRN summarized +the results of surgical glove studies and identified themes that were nearly +identical to the clinical researchers' findings. Explainable AI can accurately +expedite our understanding of clinical practices, potentially revolutionizing +healthcare research. + +摘要:系統性文獻回顧是研究中證據品質最高的。然而,回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台,旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估,使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率,達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標,將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻,儘管與非專家的判斷不同 (k = 0.2174),但包含了「乳膠」、「雙重」(手套)和「適應症」等詞彙。LRN 優於手動回顧(11 個月超過 19,920 分鐘),將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示,可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果,並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解,有潛力革新醫療保健研究。 + +##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns** +2408.02709v1 by Chi Him Ng + +This study analyzes hybrid AI systems' design patterns and their +effectiveness in clinical decision-making using the boxology framework. It +categorizes and copares various architectures combining machine learning and +rule-based reasoning to provide insights into their structural foundations and +healthcare applications. Addressing two main questions, how to categorize these +systems againts established design patterns and how to extract insights through +comparative analysis, the study uses design patterns from software engineering +to understand and optimize healthcare AI systems. Boxology helps identify +commonalities and create reusable solutions, enhancing these systems' +scalability, reliability, and performance. Five primary architectures are +examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and +weaknesses, highlighting the need for tailored approaches in clinical tasks. +REML excels in high-accuracy prediction for datasets with limited data; MLRB in +handling large datasets and complex data integration; RBML in explainability +and trustworthiness; RMLT in managing high-dimensional data; and PERML, though +limited in analysis, shows promise in urgent care scenarios. The study +introduces four new patterns, creates five abstract categorization patterns, +and refines those five further to specific systems. These contributions enhance +Boxlogy's taxonomical organization and offer novel approaches to integrating +expert knowledge with machine learning. Boxology's structured, modular apporach +offers significant advantages in developing and analyzing hybrid AI systems, +revealing commonalities, and promoting reusable solutions. In conclusion, this +study underscores hybrid AI systems' crucial role in advancing healthcare and +Boxology's potential to drive further innovation in AI integration, ultimately +improving clinical decision support and patient outcomes. + +摘要:本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構,以深入了解其結構基礎和醫療保健應用。針對兩個主要問題,如何根據既定的設計模式對這些系統進行分類,以及如何通過比較分析提取見解,本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案,從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構:REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點,強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測;MLRB 在處理大型資料集和複雜資料整合方面表現出色;RBML 在可解釋性和可信度方面表現出色;RMLT 在管理高維資料方面表現出色;而 PERML 儘管在分析方面有限,但在緊急照護場景中表現出潛力。本研究引入了四種新模式,建立了五種抽象分類模式,並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織,並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之,本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用,以及盒子學在推動人工智慧整合進一步創新方面的潛力,最終改善臨床決策支援和患者的治療成果。 + +##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability** +2408.02706v1 by Masoud Muhammed Hassan + +Because of its strong predictive skills, deep learning has emerged as an +essential tool in many industries, including healthcare. Traditional deep +learning models, on the other hand, frequently lack interpretability and omit +to take prediction uncertainty into account two crucial components of clinical +decision making. In order to produce explainable and uncertainty aware +predictions, this study presents a novel framework called Bayesian Kolmogorov +Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov +Arnold Networks with Bayesian inference. We employ BKANs on two medical +datasets, which are widely used benchmarks for assessing machine learning +models in medical diagnostics: the Pima Indians Diabetes dataset and the +Cleveland Heart Disease dataset. Our method provides useful insights into +prediction confidence and decision boundaries and outperforms traditional deep +learning models in terms of prediction accuracy. Moreover, BKANs' capacity to +represent aleatoric and epistemic uncertainty guarantees doctors receive more +solid and trustworthy decision support. Our Bayesian strategy improves the +interpretability of the model and considerably minimises overfitting, which is +important for tiny and imbalanced medical datasets, according to experimental +results. We present possible expansions to further use BKANs in more +complicated multimodal datasets and address the significance of these +discoveries for future research in building reliable AI systems for healthcare. +This work paves the way for a new paradigm in deep learning model deployment in +vital sectors where transparency and reliability are crucial. + +摘要:由於其強大的預測能力,深度學習已成為許多產業中不可或缺的工具,包括醫療保健。然而,傳統的深度學習模型通常缺乏可解釋性,並且忽略了將預測不確定性納入考量,而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測,本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構,它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN,這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準:皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解,並且在預測準確度方面優於傳統的深度學習模型。此外,BKAN 表現隨機和認識不確定性的能力,可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果,我們的貝氏策略提高了模型的可解釋性,並大幅減少了過度擬合,這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能,以進一步將 BKAN 用於更複雜的多模式資料集,並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。 + +##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI** +2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal + +In modern healthcare, addressing the complexities of accurate disease +prediction and personalized recommendations is both crucial and challenging. +This research introduces MLtoGAI, which integrates Semantic Web technology with +Machine Learning (ML) to enhance disease prediction and offer user-friendly +explanations through ChatGPT. The system comprises three key components: a +reusable disease ontology that incorporates detailed knowledge about various +diseases, a diagnostic classification model that uses patient symptoms to +detect specific diseases accurately, and the integration of Semantic Web Rule +Language (SWRL) with ontology and ChatGPT to generate clear, personalized +health advice. This approach significantly improves prediction accuracy and +ensures results that are easy to understand, addressing the complexity of +diseases and diverse symptoms. The MLtoGAI system demonstrates substantial +advancements in accuracy and user satisfaction, contributing to developing more +intelligent and accessible healthcare solutions. This innovative approach +combines the strengths of ML algorithms with the ability to provide +transparent, human-understandable explanations through ChatGPT, achieving +significant improvements in prediction accuracy and user comprehension. By +leveraging semantic technology and explainable AI, the system enhances the +accuracy of disease prediction and ensures that the recommendations are +relevant and easily understood by individual patients. Our research highlights +the potential of integrating advanced technologies to overcome existing +challenges in medical diagnostics, paving the way for future developments in +intelligent healthcare systems. Additionally, the system is validated using 200 +synthetic patient data records, ensuring robust performance and reliability. + +摘要:在現代醫療保健中,解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI,它將語義網路技術與機器學習 (ML) 相結合,以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分:一個可重複使用的疾病本体,其中包含有關各種疾病的詳細知識;一個診斷分類模型,它使用患者症狀來準確檢測特定疾病;以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合,以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性,並確保了易於理解的結果,解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步,有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點,以及透過 ChatGPT 提供透明且人類可以理解的說明的能力,在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI,該系統提高了疾病預測的準確性,並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力,為智慧醫療保健系統的未來發展鋪路。此外,該系統使用 200 個合成患者資料記錄進行驗證,確保了穩健的效能和可靠性。 + +##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations** +2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora + +Explainable Artificial Intelligence (XAI) is central to the debate on +integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms +into clinical practice. High-performing AI/ML models, such as ensemble learners +and deep neural networks, often lack interpretability, hampering clinicians' +trust in their predictions. To address this, XAI techniques are being developed +to describe AI/ML predictions in human-understandable terms. One promising +direction is the adaptation of sensitivity analysis (SA) and global sensitivity +analysis (GSA), which inherently rank model inputs by their impact on +predictions. Here, we introduce a novel delta-XAI method that provides local +explanations of ML model predictions by extending the delta index, a GSA +metric. The delta-XAI index assesses the impact of each feature's value on the +predicted output for individual instances in both regression and classification +problems. We formalize the delta-XAI index and provide code for its +implementation. The delta-XAI method was evaluated on simulated scenarios using +linear regression models, with Shapley values serving as a benchmark. Results +showed that the delta-XAI index is generally consistent with Shapley values, +with notable discrepancies in models with highly impactful or extreme feature +values. The delta-XAI index demonstrated higher sensitivity in detecting +dominant features and handling extreme feature values. Qualitatively, the +delta-XAI provides intuitive explanations by leveraging probability density +functions, making feature rankings clearer and more explainable for +practitioners. Overall, the delta-XAI method appears promising for robustly +obtaining local explanations of ML model predictions. Further investigations in +real-world clinical settings will be conducted to evaluate its impact on +AI-assisted clinical workflows. + +摘要:可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型,例如整體學習器和深度神經網路,通常缺乏可解釋性,阻礙臨床醫生對其預測的信任。為了解決這個問題,正在開發 XAI 技術,以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA),它們本質上會依據模型輸入對預測的影響來對其進行排名。在此,我們介紹一種新的 delta-XAI 方法,透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化,並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法,並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致,但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說,delta-XAI 透過利用機率密度函數提供直觀的解釋,使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言,delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查,以評估其對 AI 輔助臨床工作流程的影響。 + +##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population** +2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis + +Dementia, a debilitating neurological condition affecting millions worldwide, +presents significant diagnostic challenges. In this work, we introduce a novel +methodology for the classification of demented and non-demented elderly +patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach +features a unique technique for selectively processing MRI slices, focusing on +the most relevant brain regions and excluding less informative sections. This +methodology is complemented by a confidence-based classification committee +composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and +Dem3D EfficientNet. These models work synergistically to enhance +decision-making accuracy, leveraging their collective strengths. Tested on the +Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an +impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, +validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset +confirmed the robustness and generalizability of our approach. The use of +explainable AI (XAI) techniques and comprehensive ablation studies further +substantiate the effectiveness of our techniques, providing insights into the +decision-making process and the importance of our methodology. This research +offers a significant advancement in dementia diagnosis, providing a highly +accurate and efficient tool for clinical applications. + +摘要:失智症是一種影響全球數百萬人的衰弱性神經疾病,在診斷上具有重大挑戰。在這項工作中,我們提出了一種新的方法,用於對失智和非失智老年患者進行分類,使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術,用於選擇性處理 MRI 切片,重點關注最相關的大腦區域,並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充,該委員會由三個自定義深度學習模型組成:Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性,利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試,我們的模型達到了 94.12% 的驚人準確度,超過了現有方法。此外,在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性,提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展,為臨床應用提供了一個高度準確且高效的工具。 + +##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition** +2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini + +Recognizing daily activities with unobtrusive sensors in smart environments +enables various healthcare applications. Monitoring how subjects perform +activities at home and their changes over time can reveal early symptoms of +health issues, such as cognitive decline. Most approaches in this field use +deep learning models, which are often seen as black boxes mapping sensor data +to activities. However, non-expert users like clinicians need to trust and +understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human +Activity Recognition have emerged to provide intuitive natural language +explanations from these models. Different XAI methods generate different +explanations, and their effectiveness is typically evaluated through user +surveys, that are often challenging in terms of costs and fairness. This paper +proposes an automatic evaluation method using Large Language Models (LLMs) to +identify, in a pool of candidates, the best XAI approach for non-expert users. +Our preliminary results suggest that LLM evaluation aligns with user surveys. + +摘要:藉由智慧環境中不引人注目的感測器辨識日常活動,能啟用各種醫療保健應用。監控受試者在家中如何執行活動,以及其隨著時間的變化,可以揭示健康問題的早期症狀,例如認知能力下降。此領域中的大多數方法都使用深度學習模型,這些模型通常被視為將感測器資料對應至活動的黑盒子。然而,非專家使用者(例如臨床醫師)需要信任並了解這些模型的輸出。因此,人類活動辨識的可解釋 AI (XAI) 方法應運而生,以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明,而其有效性通常透過使用者調查來評估,這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法,以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明,LLM 評估與使用者調查一致。 + +##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions** +2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil + +Industry 5.0, which focuses on human and Artificial Intelligence (AI) +collaboration for performing different tasks in manufacturing, involves a +higher number of robots, Internet of Things (IoTs) devices and +interconnections, Augmented/Virtual Reality (AR), and other smart devices. The +huge involvement of these devices and interconnection in various critical +areas, such as economy, health, education and defense systems, poses several +types of potential security flaws. AI itself has been proven a very effective +and powerful tool in different areas of cybersecurity, such as intrusion +detection, malware detection, and phishing detection, among others. Just as in +many application areas, cybersecurity professionals were reluctant to accept +black-box ML solutions for cybersecurity applications. This reluctance pushed +forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool +that helps explain how decisions are made in ML-based systems. In this survey, +we present a comprehensive study of different XAI-based intrusion detection +systems for industry 5.0, and we also examine the impact of explainability and +interpretability on Cybersecurity practices through the lens of Adversarial +XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities +and challenges in XAI cybersecurity systems for industry 5.0 that elicit future +research toward XAI-based solutions to be adopted by high-stakes industry 5.0 +applications. We believe this rigorous analysis will establish a foundational +framework for subsequent research endeavors within the specified domain. + +摘要:工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務,涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與,引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具,例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣,網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用,有助於說明在基於 ML 的系統中如何做出決策。在這項調查中,我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究,並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外,我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰,引發了未來針對 XAI 基礎解決方案的研究,以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。 + +##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability** +2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic + +This study aims to explore the implementation of Natural Language Processing +(NLP) and machine learning (ML) techniques to automate the coding of medical +letters with visualised explainability and light-weighted local computer +settings. Currently in clinical settings, coding is a manual process that +involves assigning codes to each condition, procedure, and medication in a +patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There +are preliminary research on automatic coding in this field using +state-of-the-art ML models; however, due to the complexity and size of the +models, the real-world deployment is not achieved. To further facilitate the +possibility of automatic coding practice, we explore some solutions in a local +computer setting; in addition, we explore the function of explainability for +transparency of AI models. We used the publicly available MIMIC-III database +and the HAN/HLAN network models for ICD code prediction purposes. We also +experimented with the mapping between ICD and SNOMED CT knowledge bases. In our +experiments, the models provided useful information for 97.98\% of codes. The +result of this investigation can shed some light on implementing automatic +clinical coding in practice, such as in hospital settings, on the local +computers used by clinicians , project page +\url{https://github.com/Glenj01/Medical-Coding}. + +摘要:本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化,並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中,編碼是一種手動流程,涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如,使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究;然而,由於模型的複雜性和大小,並未實現實際部署。為了進一步促進自動編碼實務的可能性,我們在本地電腦設定中探討了一些解決方案;此外,我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中,這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解,例如在醫院環境中,由臨床醫生使用的本地電腦,專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。 + +##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation** +2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier + +The support of artificial intelligence (AI) based decision-making is a key +element in future 6G networks, where the concept of native AI will be +introduced. Moreover, AI is widely employed in different critical applications +such as autonomous driving and medical diagnosis. In such applications, using +AI as black-box models is risky and challenging. Hence, it is crucial to +understand and trust the decisions taken by these models. Tackling this issue +can be achieved by developing explainable AI (XAI) schemes that aim to explain +the logic behind the black-box model behavior, and thus, ensure its efficient +and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST +framework that is oriented toward channel estimation in wireless +communications. The core idea of the XAI-CHEST framework is to identify the +relevant model inputs by inducing high noise on the irrelevant ones. This +manuscript provides the detailed theoretical foundations of the XAI-CHEST +framework. In particular, we derive the analytical expressions of the XAI-CHEST +loss functions and the noise threshold fine-tuning optimization problem. Hence +the designed XAI-CHEST delivers a smart input feature selection methodology +that can further improve the overall performance while optimizing the +architecture of the employed model. Simulation results show that the XAI-CHEST +framework provides valid interpretations, where it offers an improved bit error +rate performance while reducing the required computational complexity in +comparison to the classical DL-based channel estimation. + +摘要:人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素,其中將引入原生 AI 的概念。此外,AI 廣泛用於不同的關鍵應用中,例如自動駕駛和醫療診斷。在這些應用中,使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此,理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構,旨在解釋黑盒模型行為背後的邏輯,從而確保其有效且安全的部署。最近,我們提出了一個新的基於擾動的 XAI-CHEST 框架,該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是,我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此,設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法,可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明,XAI-CHEST 框架提供了有效的解釋,在降低所需的計算複雜度的同時,提供了改進的比特錯誤率性能,而這與基於傳統 DL 的信道估計相比。 + +##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification** +2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman + +This paper presents dilated Residual Network (ResNet) models for disease +classification from retinal fundus images. Dilated convolution filters are used +to replace normal convolution filters in the higher layers of the ResNet model +(dilated ResNet) in order to improve the receptive field compared to the normal +ResNet model for disease classification. This study introduces +computer-assisted diagnostic tools that employ deep learning, enhanced with +explainable AI techniques. These techniques aim to make the tool's +decision-making process transparent, thereby enabling medical professionals to +understand and trust the AI's diagnostic decision. They are particularly +relevant in today's healthcare landscape, where there is a growing demand for +transparency in AI applications to ensure their reliability and ethical use. +The dilated ResNet is used as a replacement for the normal ResNet to enhance +the classification accuracy of retinal eye diseases and reduce the required +computing time. The dataset used in this work is the Ocular Disease Intelligent +Recognition (ODIR) dataset which is a structured ophthalmic database with eight +classes covering most of the common retinal eye diseases. The evaluation +metrics used in this work include precision, recall, accuracy, and F1 score. In +this work, a comparative study has been made between normal ResNet models and +dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, +ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as +compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, +and 0.70 respectively for the above respective variants in ODIR multiclass +disease classification. + +摘要:这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器(扩张 ResNet),以改善感知场,从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具,并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化,从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关,在该领域,对 AI 应用的透明度需求不断增长,以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品,以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集,这是一个结构化的眼科数据库,包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中,对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比,扩张 ResNet 模型显示出有希望的结果,在 ODIR 多类疾病分类中,上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。 + +##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis** +2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li + +The rapid advancement of foundation models in medical imaging represents a +significant leap toward enhancing diagnostic accuracy and personalized +treatment. However, the deployment of foundation models in healthcare +necessitates a rigorous examination of their trustworthiness, encompassing +privacy, robustness, reliability, explainability, and fairness. The current +body of survey literature on foundation models in medical imaging reveals +considerable gaps, particularly in the area of trustworthiness. Additionally, +existing surveys on the trustworthiness of foundation models do not adequately +address their specific variations and applications within the medical imaging +domain. This survey aims to fill that gap by presenting a novel taxonomy of +foundation models used in medical imaging and analyzing the key motivations for +ensuring their trustworthiness. We review current research on foundation models +in major medical imaging applications, focusing on segmentation, medical report +generation, medical question and answering (Q\&A), and disease diagnosis. These +areas are highlighted because they have seen a relatively mature and +substantial number of foundation models compared to other applications. We +focus on literature that discusses trustworthiness in medical image analysis +manuscripts. We explore the complex challenges of building trustworthy +foundation models for each application, summarizing current concerns and +strategies for enhancing trustworthiness. Furthermore, we examine the potential +of these models to revolutionize patient care. Our analysis underscores the +imperative for advancing towards trustworthy AI in medical image analysis, +advocating for a balanced approach that fosters innovation while ensuring +ethical and equitable healthcare delivery. + +摘要:基礎模型在醫學影像方面的快速進展,代表著在加強診斷準確性和個人化治療方面邁出一大步。然而,基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查,包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距,特別是在可信度方面。此外,現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機,來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究,重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調,是因為與其他應用相比,它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰,總結了當前關注點和增強可信度的策略。此外,我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性,並倡導一種平衡的方法,既能促進創新,又能確保道德和公平的醫療保健服務。 + +##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data** +2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan + +Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and +interpreting ultrasound scans right at the patient's bedside. However, the +expertise needed to interpret these images is considerable and may not always +be present in emergency situations. This reality makes algorithms such as +machine learning classifiers extremely valuable to augment human decisions. +POCUS devices are becoming available at a reasonable cost in the size of a +mobile phone. The challenge of turning POCUS devices into life-saving tools is +that interpretation of ultrasound images requires specialist training and +experience. Unfortunately, the difficulty to obtain positive training images +represents an important obstacle to building efficient and accurate +classifiers. Hence, the problem we try to investigate is how to explore +strategies to increase accuracy of classifiers trained with scarce data. We +hypothesize that training with a few data instances may not suffice for +classifiers to generalize causing them to overfit. Our approach uses an +Explainable AI-Augmented approach to help the algorithm learn more from less +and potentially help the classifier better generalize. + +摘要:床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而,解讀這些影像所需的專業知識相當可觀,而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出,尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於,解讀超音波影像需要專門訓練和經驗。不幸的是,取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此,我們嘗試探討的問題是如何探索策略,以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括,導致它們過度擬合。我們的做法使用可解釋 AI 增強方法,以協助演算法從較少的資料中學習更多,並潛在協助分類器更好地概括。 + +##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach** +2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang + +In recent years, the United States has witnessed a significant surge in the +popularity of vaping or e-cigarette use, leading to a notable rise in cases of +e-cigarette and vaping use-associated lung injury (EVALI) that caused +hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting +the urgency to comprehend vaping behaviors and develop effective strategies for +cessation. Due to the ubiquity of social media platforms, over 4.7 billion +users worldwide use them for connectivity, communications, news, and +entertainment with a significant portion of the discourse related to health, +thereby establishing social media data as an invaluable organic data resource +for public health research. In this study, we extracted a sample dataset from +one vaping sub-community on Reddit to analyze users' quit-vaping intentions. +Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit +vaping intention detection, this study compares the outcomes of this model +against layman and clinical expert annotations. Using different prompting +strategies such as zero-shot, one-shot, few-shot and chain-of-thought +prompting, we developed 8 prompts with varying levels of detail to explain the +task to GPT-4 and also evaluated the performance of the strategies against each +other. These preliminary findings emphasize the potential of GPT-4 in social +media data analysis, especially in identifying users' subtle intentions that +may elude human detection. + +摘要:近年來,美國見證了電子煙或電子香菸使用率大幅激增,導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加,在 2019 年 EVALI 爆發期間造成住院和死亡,凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及,全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂,其中很大一部分與健康相關,因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中,我們從 Reddit 上一個電子煙子社群中提取一個範例資料集,以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測,本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略,例如零次學習、一次學習、少次學習和思考鏈提示,我們開發了 8 個提示,詳細程度不同,向 GPT-4 解釋任務,並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力,特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。 + +##### **Towards Compositional Interpretability for XAI** +2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke + +Artificial intelligence (AI) is currently based largely on black-box machine +learning models which lack interpretability. The field of eXplainable AI (XAI) +strives to address this major concern, being critical in high-stakes areas such +as the finance, legal and health sectors. + We present an approach to defining AI models and their interpretability based +on category theory. For this we employ the notion of a compositional model, +which sees a model in terms of formal string diagrams which capture its +abstract structure together with its concrete implementation. This +comprehensive view incorporates deterministic, probabilistic and quantum +models. We compare a wide range of AI models as compositional models, including +linear and rule-based models, (recurrent) neural networks, transformers, VAEs, +and causal and DisCoCirc models. + Next we give a definition of interpretation of a model in terms of its +compositional structure, demonstrating how to analyse the interpretability of a +model, and using this to clarify common themes in XAI. We find that what makes +the standard 'intrinsically interpretable' models so transparent is brought out +most clearly diagrammatically. This leads us to the more general notion of +compositionally-interpretable (CI) models, which additionally include, for +instance, causal, conceptual space, and DisCoCirc models. + We next demonstrate the explainability benefits of CI models. Firstly, their +compositional structure may allow the computation of other quantities of +interest, and may facilitate inference from the model to the modelled +phenomenon by matching its structure. Secondly, they allow for diagrammatic +explanations for their behaviour, based on influence constraints, diagram +surgery and rewrite explanations. Finally, we discuss many future directions +for the approach, raising the question of how to learn such meaningfully +structured models in practice. + +摘要:人工智慧(AI)目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧(XAI)領域致力於解決這個主要問題,這在金融、法律和健康等高風險領域至關重要。 +我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此,我們採用組合模型的概念,它以形式弦圖的形式看待模型,這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較,包括線性和基於規則的模型、(遞迴)神經網路、Transformer、VAE,以及因果和 DisCoCirc 模型。 +接下來,我們根據模型的組合結構給出模型解釋的定義,展示如何分析模型的可解釋性,並使用它來澄清 XAI 中的常見主題。我們發現,讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋(CI)模型概念,它另外還包括因果、概念空間和 DisCoCirc 模型。 +接下來,我們展示了 CI 模型的可解釋性優勢。首先,它們的組合結構允許計算其他感興趣的量,並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次,它們允許對其行為進行圖解說明,這些說明基於影響約束、圖解手術和重寫說明。最後,我們討論了這種方法的許多未來方向,提出了如何在實踐中學習這種有意義的結構化模型的問題。 + +##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods** +2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen + +Machine learning models have achieved high overall accuracy in medical image +analysis. However, performance disparities on specific patient groups pose +challenges to their clinical utility, safety, and fairness. This can affect +known patient groups - such as those based on sex, age, or disease subtype - as +well as previously unknown and unlabeled groups. Furthermore, the root cause of +such observed performance disparities is often challenging to uncover, +hindering mitigation efforts. In this paper, to address these issues, we +leverage Slice Discovery Methods (SDMs) to identify interpretable +underperforming subsets of data and formulate hypotheses regarding the cause of +observed performance disparities. We introduce a novel SDM and apply it in a +case study on the classification of pneumothorax and atelectasis from chest +x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis +formulation and yields an explanation of previously observed but unexplained +performance disparities between male and female patients in widely used chest +X-ray datasets and models. Our findings indicate shortcut learning in both +classification tasks, through the presence of chest drains and ECG wires, +respectively. Sex-based differences in the prevalence of these shortcut +features appear to cause the observed classification performance gap, +representing a previously underappreciated interaction between shortcut +learning and model fairness analyses. + +摘要:機器學習模型在醫學影像分析中已達到整體高準確度。然而,特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體(例如基於性別、年齡或疾病亞型)以及先前未知且未標籤的群體。此外,此類觀察到的效能差異的根本原因通常難以發現,阻礙了緩解措施。在本文中,為了解決這些問題,我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集,並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM,並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性,並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明,在分類任務中,透過胸腔引流管和心電圖導線的存在,存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異,似乎會導致觀察到的分類效能差距,這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。 + +##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health** +2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa + +The concept of Metaverse has attracted a lot of attention in various fields +and one of its important applications is health and treatment. The Metaverse +has enormous potential to transform healthcare by changing patient care, +medical education, and the way teaching/learning and research are done. The +purpose of this research is to provide an introduction to the basic concepts +and fundamental technologies of the Metaverse. This paper examines the pros and +cons of the Metaverse in healthcare context and analyzes its potential from the +technology and AI perspective. In particular, the role of machine learning +methods is discussed; We will explain how machine learning algorithms can be +applied to the Metaverse generated data to gain better insights in healthcare +applications. Additionally, we examine the future visions of the Metaverse in +health delivery, by examining emerging technologies such as blockchain and also +addressing privacy concerns. The findings of this study contribute to a deeper +understanding of the applications of Metaverse in healthcare and its potential +to revolutionize the delivery of medical services. + +摘要:元宇宙的概念在各個領域都備受關注,其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育,以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點,並從技術和 AI 的角度分析其潛力。特別是,討論了機器學習方法的角色;我們將說明如何將機器學習演算法應用於元宇宙產生的資料,以獲得醫療保健應用方面的更佳見解。此外,我們透過探討區塊鏈等新興技術,並解決隱私問題,來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用,以及其在醫療服務提供方面發揮革命性變革的潛力。 + +##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI** +2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf + +Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with +no known ultimo cure and high morbidity. Research demonstrates that progressive +Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly +impacts kidney structure and functions, eventually leading to kidney failure. +With the progression of time, chronic kidney disease has moved from a +life-threatening disease affecting few people to a common disorder of varying +severity. The goal of this research is to visualize dominating features, +feature scores, and values exhibited for early prognosis and detection of CKD +using ensemble learning and explainable AI. For that, an AI-driven predictive +analytics approach is proposed to aid clinical practitioners in prescribing +lifestyle modifications for individual patients to reduce the rate of +progression of this disease. Our dataset is collected on body vitals from +individuals with CKD and healthy subjects to develop our proposed AI-driven +solution accurately. In this regard, blood and urine test results are provided, +and ensemble tree-based machine-learning models are applied to predict unseen +cases of CKD. Our research findings are validated after lengthy consultations +with nephrologists. Our experiments and interpretation results are compared +with existing explainable AI applications in various healthcare domains, +including CKD. The comparison shows that our developed AI models, particularly +the Random Forest model, have identified more features as significant +contributors than XgBoost. Interpretability (I), which measures the ratio of +important to masked features, indicates that our XgBoost model achieved a +higher score, specifically a Fidelity of 98\%, in this metric and naturally in +the FII index compared to competing models. + +摘要:慢性腎臟病 (CKD) 是一種廣泛的慢性疾病,目前尚未找到最終的治療方法,且發病率很高。研究表明,進行性慢性腎臟病 (CKD) 是一種異質性疾病,會顯著影響腎臟結構和功能,最終導致腎衰竭。隨著時間的推移,慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值,以進行 CKD 的早期預後和檢測。為此,提出了一種 AI 驅動的預測分析方法,以幫助臨床醫生為個別患者開具生活方式的修改建議,以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的,以準確開發我們提出的 AI 驅動的解決方案。在這方面,提供了血液和尿液檢測結果,並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較,包括 CKD。比較表明,我們開發的 AI 模型,特別是隨機森林模型,已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率,表明我們的 XgBoost 模型在此指標中取得了更高的分數,特別是 98% 的保真度,並且在 FII 指數中自然高於競爭模型。 + +##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook** +2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan + +Mental health constitutes a complex and pervasive global challenge, affecting +millions of lives and often leading to severe consequences. In this paper, we +conduct a thorough survey to explore the intersection of data science, +artificial intelligence, and mental healthcare, focusing on the recent +developments of mental disorder detection through online social media (OSM). A +significant portion of the population actively engages in OSM platforms, +creating a vast repository of personal data that holds immense potential for +mental health analytics. The paper navigates through traditional diagnostic +methods, state-of-the-art data- and AI-driven research studies, and the +emergence of explainable AI (XAI) models for mental healthcare. We review +state-of-the-art machine learning methods, particularly those based on modern +deep learning, while emphasising the need for explainability in healthcare AI +models. The experimental design section provides insights into prevalent +practices, including available datasets and evaluation approaches. We also +identify key issues and challenges in the field and propose promising future +research directions. As mental health decisions demand transparency, +interpretability, and ethical considerations, this paper contributes to the +ongoing discourse on advancing XAI in mental healthcare through social media. +The comprehensive overview presented here aims to guide researchers, +practitioners, and policymakers in developing the area of mental disorder +detection. + +摘要:心理健康構成了一項複雜且普遍的全球挑戰,影響了數百萬人的生活,並經常導致嚴重的後果。在本文中,我們進行了一項徹底的調查,以探索數據科學、人工智慧和心理保健的交集,重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台,創造了一個龐大的人員資料庫,對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究,以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法,特別是那些基於現代深度學習的方法,同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解,包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰,並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量,本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。 + +##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance** +2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou + +AI-aided clinical diagnosis is desired in medical care. Existing deep +learning models lack explainability and mainly focus on image analysis. The +recently developed Dynamic Uncertain Causality Graph (DUCG) approach is +causality-driven, explainable, and invariant across different application +scenarios, without problems of data collection, labeling, fitting, privacy, +bias, generalization, high cost and high energy consumption. Through close +collaboration between clinical experts and DUCG technicians, 46 DUCG models +covering 54 chief complaints were constructed. Over 1,000 diseases can be +diagnosed without triage. Before being applied in real-world, the 46 DUCG +models were retrospectively verified by third-party hospitals. The verified +diagnostic precisions were no less than 95%, in which the diagnostic precision +for every disease including uncommon ones was no less than 80%. After +verifications, the 46 DUCG models were applied in the real-world in China. Over +one million real diagnosis cases have been performed, with only 17 incorrect +diagnoses identified. Due to DUCG's transparency, the mistakes causing the +incorrect diagnoses were found and corrected. The diagnostic abilities of the +clinicians who applied DUCG frequently were improved significantly. Following +the introduction to the earlier presented DUCG methodology, the recommendation +algorithm for potential medical checks is presented and the key idea of DUCG is +extracted. + +摘要:醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性,並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的,並且在不同的應用場景中是不變的,沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作,構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前,46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%,其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後,46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例,僅發現 17 個不正確的診斷。由於 DUCG 的透明性,發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後,提出了潛在健康檢查的推薦演算法,並提取了 DUCG 的關鍵思想。 + +##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability** +2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi + +It is imperative that breast cancer is detected precisely and timely to +improve patient outcomes. Diagnostic methodologies have traditionally relied on +unimodal approaches; however, medical data analytics is integrating diverse +data sources beyond conventional imaging. Using multi-modal techniques, +integrating both image and non-image data, marks a transformative advancement +in breast cancer diagnosis. The purpose of this review is to explore the +burgeoning field of multimodal techniques, particularly the fusion of +histopathology images with non-image data. Further, Explainable AI (XAI) will +be used to elucidate the decision-making processes of complex algorithms, +emphasizing the necessity of explainability in diagnostic processes. This +review utilizes multi-modal data and emphasizes explainability to enhance +diagnostic accuracy, clinician confidence, and patient engagement, ultimately +fostering more personalized treatment strategies for breast cancer, while also +identifying research gaps in multi-modality and explainability, guiding future +studies, and contributing to the strategic direction of the field. + +摘要:精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法;然而,醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術,標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域,特別是將組織病理學影像與非影像資料融合。此外,可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程,強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性,以提高診斷準確性、臨床醫師的信心和患者參與度,最終促進乳癌更個人化的治療策略,同時也找出多模式和可解釋性的研究差距,引導未來的研究,並為該領域的策略方向做出貢獻。 + +##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection** +2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya + +The neonatal period is the most vulnerable time for the development of +seizures. Seizures in the immature brain lead to detrimental consequences, +therefore require early diagnosis. The gold-standard for neonatal seizure +detection currently relies on continuous video-EEG monitoring; which involves +recording multi-channel electroencephalogram (EEG) alongside real-time video +monitoring within a neonatal intensive care unit (NICU). However, video-EEG +monitoring technology requires clinical expertise and is often limited to +technologically advanced and resourceful settings. Cost-effective new +techniques could help the medical fraternity make an accurate diagnosis and +advocate treatment without delay. In this work, a novel explainable deep +learning model to automate the neonatal seizure detection process with a +reduced EEG montage is proposed, which employs convolutional nets, graph +attention layers, and fully connected layers. Beyond its ability to detect +seizures in real-time with a reduced montage, this model offers the unique +advantage of real-time interpretability. By evaluating the performance on the +Zenodo dataset with 10-fold cross-validation, the presented model achieves an +absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall, +respectively. + +摘要:新生兒期是大腦發育最脆弱的時期,容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果,因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測;其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而,視訊腦電圖監控技術需要臨床專業知識,而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中,提出了一個新穎的可解釋深度學習模型,以自動化新生兒癲癇發作偵測過程,並採用減少的腦電圖裝置,其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外,此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能,所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。 + +##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques** +2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik + +Breast cancer (BC) stands as one of the most common malignancies affecting +women worldwide, necessitating advancements in diagnostic methodologies for +better clinical outcomes. This article provides a comprehensive exploration of +the application of Explainable Artificial Intelligence (XAI) techniques in the +detection and diagnosis of breast cancer. As Artificial Intelligence (AI) +technologies continue to permeate the healthcare sector, particularly in +oncology, the need for transparent and interpretable models becomes imperative +to enhance clinical decision-making and patient care. This review discusses the +integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and +others, with machine learning and deep learning models utilized in breast +cancer detection and classification. By investigating the modalities of breast +cancer datasets, including mammograms, ultrasounds and their processing with +AI, the paper highlights how XAI can lead to more accurate diagnoses and +personalized treatment plans. It also examines the challenges in implementing +these techniques and the importance of developing standardized metrics for +evaluating XAI's effectiveness in clinical settings. Through detailed analysis +and discussion, this article aims to highlight the potential of XAI in bridging +the gap between complex AI models and practical healthcare applications, +thereby fostering trust and understanding among medical professionals and +improving patient outcomes. + +摘要:乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一,因此需要進步的診斷方法,以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域,特別是在腫瘤學中,透明且可解釋的模型需求變得勢在必行,以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合,例如 SHAP、LIME、Grad-CAM 等,以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式,包括乳房攝影、超音波及其在 AI 中的處理,本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰,以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論,本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力,進而促進醫療專業人員之間的信任與理解,並改善患者的結果。 + +##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition** +2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara + +Speech emotion recognition (SER) has gained significant attention due to its +several application fields, such as mental health, education, and +human-computer interaction. However, the accuracy of SER systems is hindered by +high-dimensional feature sets that may contain irrelevant and redundant +information. To overcome this challenge, this study proposes an iterative +feature boosting approach for SER that emphasizes feature relevance and +explainability to enhance machine learning model performance. Our approach +involves meticulous feature selection and analysis to build efficient SER +systems. In addressing our main problem through model explainability, we employ +a feature evaluation loop with Shapley values to iteratively refine feature +sets. This process strikes a balance between model performance and +transparency, which enables a comprehensive understanding of the model's +predictions. The proposed approach offers several advantages, including the +identification and removal of irrelevant and redundant features, leading to a +more effective model. Additionally, it promotes explainability, facilitating +comprehension of the model's predictions and the identification of crucial +features for emotion determination. The effectiveness of the proposed method is +validated on the SER benchmarks of the Toronto emotional speech set (TESS), +Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of +Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion +(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our +knowledge, this is the first work to incorporate model explainability into an +SER framework. The source code of this paper is publicly available via this +https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition. + +摘要:語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而,SER 系統的準確性受到高維特徵集的阻礙,這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰,本研究提出了一種用於 SER 的迭代特徵提升方法,該方法強調特徵相關性和可解釋性,以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析,以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題,我們採用了具有 Shapley 值的特徵評估迴圈,以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡,這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點,包括識別和移除不相關和冗餘的特徵,從而建立更有效的模型。此外,它促進了可解釋性,有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證,其效能優於現有方法。據我們所知,這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得:https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。 + +##### **The Explanation Necessity for Healthcare AI** +2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling + +Explainability is often critical to the acceptable implementation of +artificial intelligence (AI). Nowhere is this more important than healthcare +where decision-making directly impacts patients and trust in AI systems is +essential. This trust is often built on the explanations and interpretations +the AI provides. Despite significant advancements in AI interpretability, there +remains the need for clear guidelines on when and to what extent explanations +are necessary in the medical context. We propose a novel categorization system +with four distinct classes of explanation necessity, guiding the level of +explanation required: patient or sample (local) level, cohort or dataset +(global) level, or both levels. We introduce a mathematical formulation that +distinguishes these categories and offers a practical framework for researchers +to determine the necessity and depth of explanations required in medical AI +applications. Three key factors are considered: the robustness of the +evaluation protocol, the variability of expert observations, and the +representation dimensionality of the application. In this perspective, we +address the question: When does an AI medical application need to be explained, +and at what level of detail? + +摘要:可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域,这一点尤为重要,因为决策直接影响患者,并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展,但仍然需要明确的指导方针,说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统,该系统具有四种不同的解释必要性类别,指导所需的解释级别:患者或样本(局部)级别、队列或数据集(全局)级别,或两个级别。我们引入了一个数学公式,该公式区分了这些类别,并为研究人员提供了一个实用框架,以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素:评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看,我们解决了这个问题:AI 医疗应用何时需要解释,以及需要解释到何种程度? + +##### **Interdisciplinary Expertise to Advance Equitable Explainable AI** +2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles + +The field of artificial intelligence (AI) is rapidly influencing health and +healthcare, but bias and poor performance persists for populations who face +widespread structural oppression. Previous work has clearly outlined the need +for more rigorous attention to data representativeness and model performance to +advance equity and reduce bias. However, there is an opportunity to also +improve the explainability of AI by leveraging best practices of social +epidemiology and health equity to help us develop hypotheses for associations +found. In this paper, we focus on explainable AI (XAI) and describe a framework +for interdisciplinary expert panel review to discuss and critically assess AI +model explanations from multiple perspectives and identify areas of bias and +directions for future research. We emphasize the importance of the +interdisciplinary expert panel to produce more accurate, equitable +interpretations which are historically and contextually informed. +Interdisciplinary panel discussions can help reduce bias, identify potential +confounders, and identify opportunities for additional research where there are +gaps in the literature. In turn, these insights can suggest opportunities for +AI model improvement. + +摘要:人工智慧 (AI) 領域正快速影響著健康與醫療保健,但對於面臨廣泛結構性壓迫的人群來說,偏見和不良表現依然存在。先前的研究已清楚說明,需要更嚴格地注意資料代表性和模型效能,以促進公平性並減少偏見。然而,我們有機會透過運用社會流行病學和健康公平的最佳實務,來改善 AI 的可解釋性,以幫助我們針對發現的關聯性,發展假設。在本文中,我們專注於可解釋 AI (XAI),並描述一個跨領域專家小組審查架構,以從多重觀點討論和批判性評估 AI 模型的解釋,並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要,而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素,並在文獻中有缺口時找出額外研究的機會。反過來,這些見解可以建議 AI 模型改進的機會。 + +##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts** +2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen + +Artificial Intelligence (AI) repeatedly match or outperform radiologists in +lab experiments. However, real-world implementations of radiological AI-based +systems are found to provide little to no clinical value. This paper explores +how to design AI for clinical usefulness in different contexts. We conducted 19 +design sessions and design interventions with 13 radiologists from 7 clinical +sites in Denmark and Kenya, based on three iterations of a functional AI-based +prototype. Ten sociotechnical dependencies were identified as crucial for the +design of AI in radiology. We conceptualised four technical dimensions that +must be configured to the intended clinical context of use: AI functionality, +AI medical focus, AI decision threshold, and AI Explainability. We present four +design recommendations on how to address dependencies pertaining to the medical +knowledge, clinic type, user expertise level, patient context, and user +situation that condition the configuration of these technical dimensions. + +摘要:人工智慧(AI)在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而,發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代,在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向,必須根據預期的臨床使用情境進行設定:AI 功能、AI 醫療重點、AI 決策門檻,以及 AI 可解釋性。我們提出四項設計建議,說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境,以及影響這些技術面向設定的使用者情境相關的依賴關係。 + +##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making** +2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah + +With advanced AI/ML, there has been growing research on explainable AI (XAI) +and studies on how humans interact with AI and XAI for effective human-AI +collaborative decision-making. However, we still have a lack of understanding +of how AI systems and XAI should be first presented to users without technical +backgrounds. In this paper, we present the findings of semi-structured +interviews with health professionals (n=12) and students (n=4) majoring in +medicine and health to study how to improve onboarding with AI and XAI. For the +interviews, we built upon human-AI interaction guidelines to create onboarding +materials of an AI system for stroke rehabilitation assessment and AI +explanations and introduce them to the participants. Our findings reveal that +beyond presenting traditional performance metrics on AI, participants desired +benchmark information, the practical benefits of AI, and interaction trials to +better contextualize AI performance, and refine the objectives and performance +of AI. Based on these findings, we highlight directions for improving +onboarding with AI and XAI and human-AI collaborative decision-making. + +摘要:隨著先進的 AI/ML,對可解釋 AI (XAI) 的研究不斷增加,以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而,我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中,我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果,以研究如何改善 AI 和 XAI 的入門。對於訪談,我們建立在人機互動準則之上,為中風康復評估和 AI 解釋的 AI 系統創建入門材料,並將它們介紹給參與者。我們的研究結果表明,除了呈現傳統的 AI 性能指標外,參與者還希望基准信息、AI 的實際好處以及交互試驗,以更好地將 AI 性能情境化,並完善 AI 的目標和性能。根據這些發現,我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。 + +##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach** +2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao + +This article uses machine learning (ML) and explainable artificial +intelligence (XAI) techniques to investigate the relationship between +nutritional status and mortality rates associated with Alzheimers disease (AD). +The Third National Health and Nutrition Examination Survey (NHANES III) +database is employed for analysis. The random forest model is selected as the +base model for XAI analysis, and the Shapley Additive Explanations (SHAP) +method is used to assess feature importance. The results highlight significant +nutritional factors such as serum vitamin B12 and glycated hemoglobin. The +study demonstrates the effectiveness of random forests in predicting AD +mortality compared to other diseases. This research provides insights into the +impact of nutrition on AD and contributes to a deeper understanding of disease +progression. + +摘要:本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型,並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素,例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解,並有助於更深入地了解疾病的進展。 + +##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone** +2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath + +Primary care providers are vital for initial triage and referrals to +specialty care. In glaucoma, asymptomatic and fast progression can lead to +vision loss, necessitating timely referrals to specialists. However, primary +eye care providers may not identify urgent cases, potentially delaying care. +Artificial Intelligence (AI) offering explanations could enhance their referral +decisions. We investigate how various AI explanations help providers +distinguish between patients needing immediate or non-urgent specialist +referrals. We built explainable AI algorithms to predict glaucoma surgery needs +from routine eyecare data as a proxy for identifying high-risk patients. We +incorporated intrinsic and post-hoc explainability and conducted an online +study with optometrists to assess human-AI team performance, measuring referral +accuracy and analyzing interactions with AI, including agreement rates, task +time, and user experience perceptions. AI support enhanced referral accuracy +among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams +underperformed compared to AI alone. Participants believed they included AI +advice more when using the intrinsic model, and perceived it more useful and +promising. Without explanations, deviations from AI recommendations increased. +AI support did not increase workload, confidence, and trust, but reduced +challenges. On a separate test set, our black-box and intrinsic models achieved +an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We +identify opportunities of human-AI teaming for glaucoma management in primary +eye care, noting that while AI enhances referral accuracy, it also shows a +performance gap compared to AI alone, even with explanations. Human involvement +remains essential in medical decision making, underscoring the need for future +research to optimize collaboration, ensuring positive experiences and safe AI +use. + +摘要:初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下,無症狀且快速惡化可能導致視力喪失,因此需要及時轉診給專家。然而,初級眼科保健提供者可能無法識別緊急情況,可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法,以從例行眼科護理資料預測青光眼手術需求,作為識別高風險患者的代理。我們納入了內在和事後解釋性,並與驗光師進行了一項線上研究,以評估人機團隊的表現,衡量轉診準確度並分析與 AI 的互動,包括同意率、任務時間和使用者體驗感知。在 87 名參與者中,AI 支援提高了轉診準確度(使用 AI/未使用的比例為 59.9%/50.8%),儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議,並認為它更有用且更有希望。沒有解釋,AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任,但減少了挑戰。在一個單獨的測試集中,我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中,人機團隊合作管理青光眼的機會,並注意到雖然 AI 提高了轉診準確度,但即使有解釋,它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要,這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。 + +##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery** +2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang + +In medical imaging, particularly in early disease detection and prognosis +tasks, discerning the rationale behind an AI model's predictions is crucial for +evaluating the reliability of its decisions. Conventional explanation methods +face challenges in identifying discernible decisive features in medical image +classifications, where discriminative features are subtle or not immediately +apparent. To bridge this gap, we propose an explainable model that is equipped +with both decision reasoning and feature identification capabilities. Our +approach not only detects influential image patterns but also uncovers the +decisive features that drive the model's final predictions. By implementing our +method, we can efficiently identify and visualise class-specific features +leveraged by the data-driven model, providing insights into the decision-making +processes of deep learning models. We validated our model in the demanding +realm of medical prognosis task, demonstrating its efficacy and potential in +enhancing the reliability of AI in healthcare and in discovering new knowledge +in diseases where prognostic understanding is limited. + +摘要:在醫學影像中,特別是在早期疾病檢測和預後任務中,辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰,其中區別性特徵很微妙或並不明顯。為了彌合這一差距,我們提出了一個可解釋的模型,該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式,還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型,我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵,從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型,展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。 + +##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach** +2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat + +This study explores the relationship between informational support seeking +questions, responses, and helpfulness ratings in online health communities. We +created a labeled data set of question-response pairs and developed multimodal +machine learning and deep learning models to reliably predict informational +support questions and responses. We employed explainable AI to reveal the +emotions embedded in informational support exchanges, demonstrating the +importance of emotion in providing informational support. This complex +interplay between emotional and informational support has not been previously +researched. The study refines social support theory and lays the groundwork for +the development of user decision aids. Further implications are discussed. + +摘要:本研究探討線上健康社群中尋求資訊支持的問題、回應,以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集,並開發了多模態機器學習和深度學習模型,以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒,證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論,並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。 + +##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education** +2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis + +In the era of exponential technology growth, one unexpected guest has claimed +a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as +ChatGPT, promises a revolution in education, yet it arrives with a double-edged +sword. Its potential for personalized learning is offset by issues of cheating, +inaccuracies, and educators struggling to incorporate it effectively into their +lesson design. We are standing on the brink of this educational frontier, and +it is clear that we need to navigate this terrain with a lot of care. This is a +major challenge that could undermine the integrity and value of our educational +process. So, how can we turn these challenges into opportunities? When used +inappropriately, AI tools can become the perfect tool for the cut copy paste +mentality, and quickly begin to corrode critical thinking, creativity, and deep +understanding, the most important skills in our rapidly changing world. +Teachers feel that they are not equipped to leverage this technology, widening +the digital divide among educators and institutions. Addressing these concerns +calls for an in depth research approach. We will employ empirical research, +drawing on the Technology Acceptance Model, to assess the attitudes toward +generative AI among educators and students. Understanding their perceptions, +usage patterns, and hurdles is the first crucial step in creating an effective +solution. The present study will be used as a process manual for future +researchers to apply, running their own data, based on the steps explained here + +摘要:在科技飛速發展的時代,一位意外的訪客已在全球教室中佔有一席之地,那就是人工智慧。生成式 AI,例如 ChatGPT,承諾在教育領域掀起一場革命,但它卻是一把雙面刃。它在個人化學習方面的潛力,卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣,顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰,可能會損害我們教育過程的完整性和價值。那麼,我們如何將這些挑戰轉化為機遇?當不適當地使用時,AI 工具可能會成為複製貼上心態的完美工具,並迅速腐蝕批判性思維、創造力和深入理解,這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術,這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究,借鑑技術接受模型,來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊,根據此處說明的步驟運行他們自己的數據 + +##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data** +2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk + +With the digitalization of health care systems, artificial intelligence +becomes more present in medicine. Especially machine learning shows great +potential for complex tasks such as time series classification, usually at the +cost of transparency and comprehensibility. This leads to a lack of trust by +humans and thus hinders its active usage. Explainable artificial intelligence +tries to close this gap by providing insight into the decision-making process, +the actual usefulness of its different methods is however unclear. This paper +proposes a user study based evaluation of the explanation method Grad-CAM with +application to a neural network for the classification of breaths in time +series neonatal ventilation data. We present the perceived usefulness of the +explainability method by different stakeholders, exposing the difficulty to +achieve actual transparency and the wish for more in-depth explanations by many +of the participants. + +摘要:隨著醫療保健系統的數位化,人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力,但通常是以透明度和可理解性為代價。這導致人類缺乏信任,從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距,但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估,其中包含了 Grad-CAM 解釋方法,並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用,揭示了實現實際透明度的難度,以及許多參與者希望獲得更深入的解釋。 + +##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare** +2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio + +The integration of Large Language Models (LLMs) into healthcare diagnostics +offers a promising avenue for clinical decision-making. This study outlines the +development of a novel method for zero-shot/few-shot in-context learning (ICL) +by integrating medical domain knowledge using a multi-layered structured +prompt. We also explore the efficacy of two communication styles between the +user and LLMs: the Numerical Conversational (NC) style, which processes data +incrementally, and the Natural Language Single-Turn (NL-ST) style, which +employs long narrative prompts. + Our study systematically evaluates the diagnostic accuracy and risk factors, +including gender bias and false negative rates, using a dataset of 920 patient +records in various few-shot scenarios. Results indicate that traditional +clinical machine learning (ML) models generally outperform LLMs in zero-shot +and few-shot settings. However, the performance gap narrows significantly when +employing few-shot examples alongside effective explainable AI (XAI) methods as +sources of domain knowledge. Moreover, with sufficient time and an increased +number of examples, the conversational style (NC) nearly matches the +performance of ML models. Most notably, LLMs demonstrate comparable or superior +cost-sensitive accuracy relative to ML models. + This research confirms that, with appropriate domain knowledge and tailored +communication strategies, LLMs can significantly enhance diagnostic processes. +The findings highlight the importance of optimizing the number of training +examples and communication styles to improve accuracy and reduce biases in LLM +applications. + +摘要:大型語言模型 (LLM) 與醫療診斷整合 +為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發,用於零次學習/少量學習情境學習 (ICL),方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效:數值對話 (NC) 方式,它會逐步處理資料,以及自然語言單回合 (NL-ST) 方式,它會使用長篇敘事提示。 +我們的研究系統性地評估了診斷準確性和風險因子,包括性別偏見和假陰性率,使用了一個包含 920 個患者記錄的資料集,採用各種少量學習情境。結果表明,傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而,當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時,效能差距會顯著縮小。此外,隨著時間充足和範例數量增加,對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是,LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。 +本研究證實,透過適當的領域知識和量身打造的溝通策略,LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性,以提高準確度並減少 LLM 應用中的偏差。 + +##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems** +2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho + +The increasing reliance on Deep Learning models, combined with their inherent +lack of transparency, has spurred the development of a novel field of study +known as eXplainable AI (XAI) methods. These methods seek to enhance the trust +of end-users in automated systems by providing insights into the rationale +behind their decisions. This paper presents a novel approach for measuring user +trust in XAI systems, allowing their refinement. Our proposed metric combines +both performance metrics and trust indicators from an objective perspective. To +validate this novel methodology, we conducted a case study in a realistic +medical scenario: the usage of XAI system for the detection of pneumonia from +x-ray images. + +摘要:隨著對深度學習模型依賴性的增加,加上其固有的透明度不足,促使一個新的研究領域發展,稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理,來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法,允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法,我們在一個真實的醫療場景中進行了一個案例研究:使用 XAI 系統從 X 光影像中偵測肺炎。 + +##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19** +2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao + +The COVID-19 pandemic has strained global public health, necessitating +accurate diagnosis and intervention to control disease spread and reduce +mortality rates. This paper introduces an interpretable deep survival +prediction model designed specifically for improved understanding and trust in +COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale +pretrained image encoder, Risk-specific Grad-CAM, and anatomical region +detection techniques, our approach produces regional interpretable outcomes +that effectively capture essential disease features while focusing on rare but +critical abnormal regions. Our model's predictive results provide enhanced +clarity and transparency through risk area localization, enabling clinicians to +make informed decisions regarding COVID-19 diagnosis with better understanding +of prognostic insights. We evaluate the proposed method on a multi-center +survival dataset and demonstrate its effectiveness via quantitative and +qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and +time-dependent AUCs (0.799 and 0.691). These results suggest that our +explainable deep survival prediction model surpasses traditional survival +analysis methods in risk prediction, improving interpretability for clinical +decision making and enhancing AI system trustworthiness. + +摘要:COVID-19 疫情對全球公共衛生造成壓力,必須進行準確的診斷和干預,以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型,專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術,我們的做法產生區域可解釋的結果,有效捕捉必要的疾病特徵,同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度,讓臨床醫生能夠在更了解預後見解的情況下,就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法,並透過量化和質化評估證明其有效性,達到優異的 C 指數(0.764 和 0.727)和時間相關 AUC(0.799 和 0.691)。這些結果表明,我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法,提升臨床決策的解釋性,並增強 AI 系統的信賴度。 + +##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics** +2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile + +In recent years, machine learning-based clinical decision support systems +(CDSS) have played a key role in the analysis of several medical conditions. +Despite their promising capabilities, the lack of transparency in AI models +poses significant challenges, particularly in medical contexts where +reliability is a mandatory aspect. However, it appears that explainability is +inversely proportional to accuracy. For this reason, achieving transparency +without compromising predictive accuracy remains a key challenge. This paper +presents a novel method, namely Rad4XCNN, to enhance the predictive power of +CNN-derived features with the inherent interpretability of radiomic features. +Rad4XCNN diverges from conventional methods based on saliency maps, by +associating intelligible meaning to CNN-derived features by means of Radiomics, +offering new perspectives on explanation methods beyond visualization maps. +Using a breast cancer classification task as a case study, we evaluated +Rad4XCNN on ultrasound imaging datasets, including an online dataset and two +in-house datasets for internal and external validation. Some key results are: +i) CNN-derived features guarantee more robust accuracy when compared against +ViT-derived and radiomic features; ii) conventional visualization map methods +for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice +model accuracy for their explainability; iv) Rad4XCNN provides a global +explanation enabling the physician to extract global insights and findings. Our +method can mitigate some concerns related to the explainability-accuracy +trade-off. This study highlighted the importance of proposing new methods for +model explanation without affecting their accuracy. + +摘要:近年来,基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景,但 AI 模型缺乏透明度,尤其在医疗领域,可靠性是强制性方面,这带来了重大挑战。然而,解释性似乎与准确性成反比。因此,在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法,即 Rad4XCNN,以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来,从而偏离了基于显着性图的传统方法,为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究,我们在超声成像数据集上评估了 Rad4XCNN,包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是:i) 与 ViT 衍生和放射组学特征相比,CNN 衍生特征保证了更稳健的准确性;ii) 用于解释的传统可视化图方法存在一些缺陷;iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性;iv) Rad4XCNN 提供全局解释,使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。 + +##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability** +2404.16957v1 by Yunfei Ge, Quanyan Zhu + +The pervasive integration of Artificial Intelligence (AI) has introduced +complex challenges in the responsibility and accountability in the event of +incidents involving AI-enabled systems. The interconnectivity of these systems, +ethical concerns of AI-induced incidents, coupled with uncertainties in AI +technology and the absence of corresponding regulations, have made traditional +responsibility attribution challenging. To this end, this work proposes a +Computational Reflective Equilibrium (CRE) approach to establish a coherent and +ethically acceptable responsibility attribution framework for all stakeholders. +The computational approach provides a structured analysis that overcomes the +limitations of conceptual approaches in dealing with dynamic and multifaceted +scenarios, showcasing the framework's explainability, coherence, and adaptivity +properties in the responsibility attribution process. We examine the pivotal +role of the initial activation level associated with claims in equilibrium +computation. Using an AI-assisted medical decision-support system as a case +study, we illustrate how different initializations lead to diverse +responsibility distributions. The framework offers valuable insights into +accountability in AI-induced incidents, facilitating the development of a +sustainable and resilient system through continuous monitoring, revision, and +reflection. + +摘要:隨著人工智慧 (AI) 的普及整合,在涉及 AI 驅動系統的事故中,責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題,加上 AI 技術的不確定性和缺乏相應法規,使得傳統責任歸屬面臨挑戰。為此,本研究提出了一種計算反思均衡 (CRE) 方法,以建立一個連貫且在倫理上可接受的責任歸屬架構,適用於所有利害關係人。計算方法提供了結構化的分析,克服了概念方法在處理動態且多面向情境時的限制,展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究,說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解,透過持續監控、修訂和反思,促進了永續且有韌性的系統發展。 + +##### **Explainable AI for Fair Sepsis Mortality Predictive Model** +2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang + +Artificial intelligence supports healthcare professionals with predictive +modeling, greatly transforming clinical decision-making. This study addresses +the crucial need for fairness and explainability in AI applications within +healthcare to ensure equitable outcomes across diverse patient demographics. By +focusing on the predictive modeling of sepsis-related mortality, we propose a +method that learns a performance-optimized predictive model and then employs +the transfer learning process to produce a model with better fairness. Our +method also introduces a novel permutation-based feature importance algorithm +aiming at elucidating the contribution of each feature in enhancing fairness on +predictions. Unlike existing explainability methods concentrating on explaining +feature contribution to predictive performance, our proposed method uniquely +bridges the gap in understanding how each feature contributes to fairness. This +advancement is pivotal, given sepsis's significant mortality rate and its role +in one-third of hospital deaths. Our method not only aids in identifying and +mitigating biases within the predictive model but also fosters trust among +healthcare stakeholders by improving the transparency and fairness of model +predictions, thereby contributing to more equitable and trustworthy healthcare +delivery. + +摘要:人工智慧透過預測模型協助醫療專業人員,大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求,以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型,我們提出了一種方法,該方法會學習一個效能最佳化的預測模型,然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法,旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同,我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要,因為敗血症的死亡率很高,且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差,還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任,進而有助於提供更公平且值得信賴的醫療保健服務。 + +##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence** +2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal + +Depression is a significant issue nowadays. As per the World Health +Organization (WHO), in 2023, over 280 million individuals are grappling with +depression. This is a huge number; if not taken seriously, these numbers will +increase rapidly. About 4.89 billion individuals are social media users. People +express their feelings and emotions on platforms like Twitter, Facebook, +Reddit, Instagram, etc. These platforms contain valuable information which can +be used for research purposes. Considerable research has been conducted across +various social media platforms. However, certain limitations persist in these +endeavors. Particularly, previous studies were only focused on detecting +depression and the intensity of depression in tweets. Also, there existed +inaccuracies in dataset labeling. In this research work, five types of +depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted +using tweets from the Twitter database based on lexicon labeling. Explainable +AI was used to provide reasoning by highlighting the parts of tweets that +represent type of depression. Bidirectional Encoder Representations from +Transformers (BERT) was used for feature extraction and training. Machine +learning and deep learning methodologies were used to train the model. The BERT +model presented the most promising results, achieving an overall accuracy of +0.96. + +摘要:現今,憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料,在 2023 年,超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字;如果不認真看待,這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊,可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而,這些努力仍存在某些限制。特別是,先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外,資料集標籤中存在不準確的情況。在這項研究工作中,使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症(雙極型、重度、精神病型、非典型和產後)。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers(BERT)中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果,達到 0.96 的整體準確度。 + +##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images** +2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman + +Deep learning is dramatically transforming the field of medical imaging and +radiology, enabling the identification of pathologies in medical images, +including computed tomography (CT) and X-ray scans. However, the performance of +deep learning models, particularly in segmentation tasks, is often limited by +the need for extensive annotated datasets. To address this challenge, the +capabilities of weakly supervised semantic segmentation are explored through +the lens of Explainable AI and the generation of counterfactual explanations. +The scope of this research is development of a novel counterfactual inpainting +approach (COIN) that flips the predicted classification label from abnormal to +normal by using a generative model. For instance, if the classifier deems an +input medical image X as abnormal, indicating the presence of a pathology, the +generative model aims to inpaint the abnormal region, thus reversing the +classifier's original prediction label. The approach enables us to produce +precise segmentations for pathologies without depending on pre-existing +segmentation masks. Crucially, image-level labels are utilized, which are +substantially easier to acquire than creating detailed segmentation masks. The +effectiveness of the method is demonstrated by segmenting synthetic targets and +actual kidney tumors from CT images acquired from Tartu University Hospital in +Estonia. The findings indicate that COIN greatly surpasses established +attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an +alternative counterfactual explanation method introduced by Singla et al. This +evidence suggests that COIN is a promising approach for semantic segmentation +of tumors in CT images, and presents a step forward in making deep learning +applications more accessible and effective in healthcare, where annotated data +is scarce. + +摘要:深度学习正大幅轉變醫學影像和放射線學領域,能辨識醫學影像中的病理,包括電腦斷層掃描 (CT) 和 X 光掃描。然而,深度學習模型的效能,特別是在分割任務中,常常受到廣泛註解資料集需求的限制。為了應對此挑戰,透過可解釋 AI 和反事實解釋的產生,探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN),該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如,如果分類器將輸入的醫學影像 X 視為異常,表示存在病理,則生成模型旨在內插異常區域,從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割,而無需依賴於預先存在的分割遮罩。至關重要的是,利用影像層級標籤,這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明,COIN 遠遠超過已建立的歸因方法,例如 RISE、ScoreCAM 和 LayerCAM,以及 Singla 等人提出的另一種反事實解釋方法。此證據表明,COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法,並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步,其中註解資料很稀少。 + +##### **Hybrid Intelligence for Digital Humanities** +2406.15374v1 by Victor de Boer, Lise Stork + +In this paper, we explore the synergies between Digital Humanities (DH) as a +discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research, +the use of digital methods and specifically that of Artificial Intelligence is +subject to a set of requirements and constraints. We argue that these are +well-supported by the capabilities and goals of HI. Our contribution includes +the identification of five such DH requirements: Successful AI systems need to +be able to 1) collaborate with the (human) scholar; 2) support data criticism; +3) support tool criticism; 4) be aware of and cater to various perspectives and +5) support distant and close reading. We take the CARE principles of Hybrid +Intelligence (collaborative, adaptive, responsible and explainable) as +theoretical framework and map these to the DH requirements. In this mapping, we +include example research projects. We finally address how insights from DH can +be applied to HI and discuss open challenges for the combination of the two +disciplines. + +摘要:在本文中,我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中,數位方法的使用,特別是人工智慧的使用,受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求:成功的 AI 系統需要能夠 1) 與(人類)學者合作;2) 支援資料批評;3) 支援工具批評;4) 察覺並迎合各種觀點;5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則(協作、適應、負責和可解釋)作為理論架構,並將這些原則對應到 DH 要求。在此對應中,我們納入範例研究專案。最後,我們探討如何將 DH 的見解應用於 HI,並討論結合這兩個學科的開放挑戰。 + +##### **Ethical Framework for Responsible Foundational Models in Medical Imaging** +2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci + +Foundational models (FMs) have tremendous potential to revolutionize medical +imaging. However, their deployment in real-world clinical settings demands +extensive ethical considerations. This paper aims to highlight the ethical +concerns related to FMs and propose a framework to guide their responsible +development and implementation within medicine. We meticulously examine ethical +issues such as privacy of patient data, bias mitigation, algorithmic +transparency, explainability and accountability. The proposed framework is +designed to prioritize patient welfare, mitigate potential risks, and foster +trust in AI-assisted healthcare. + +摘要:基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而,它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題,並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題,例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險,並培養對 AI 輔助醫療保健的信任。 + +##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis** +2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak + +Thyroid cancer is an increasing global health concern that requires advanced +diagnostic methods. The application of AI and radiomics to thyroid cancer +diagnosis is examined in this review. A review of multiple databases was +conducted in compliance with PRISMA guidelines until October 2023. A +combination of keywords led to the discovery of an English academic publication +on thyroid cancer and related subjects. 267 papers were returned from the +original search after 109 duplicates were removed. Relevant studies were +selected according to predetermined criteria after 124 articles were eliminated +based on an examination of their abstract and title. After the comprehensive +analysis, an additional six studies were excluded. Among the 28 included +studies, radiomics analysis, which incorporates ultrasound (US) images, +demonstrated its effectiveness in diagnosing thyroid cancer. Various results +were noted, some of the studies presenting new strategies that outperformed the +status quo. The literature has emphasized various challenges faced by AI +models, including interpretability issues, dataset constraints, and operator +dependence. The synthesized findings of the 28 included studies mentioned the +need for standardization efforts and prospective multicenter studies to address +these concerns. Furthermore, approaches to overcome these obstacles were +identified, such as advances in explainable AI technology and personalized +medicine techniques. The review focuses on how AI and radiomics could transform +the diagnosis and treatment of thyroid cancer. Despite challenges, future +research on multidisciplinary cooperation, clinical applicability validation, +and algorithm improvement holds the potential to improve patient outcomes and +diagnostic precision in the treatment of thyroid cancer. + +摘要:甲狀腺癌是一種日益嚴重的全球健康問題,需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下,對多個資料庫進行了回顧,直到 2023 年 10 月。通過結合關鍵字,發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後,原始搜尋共回傳 267 篇論文。在根據預先確定的標準,淘汰了 124 篇文章的摘要和標題後,選出了相關研究。在進行全面分析後,額外排除了六項研究。在納入的 28 項研究中,結合超音波 (US) 影像的放射特徵分析,證明了其在診斷甲狀腺癌方面的有效性。研究結果不一,有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰,包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到,需要標準化工作和前瞻性多中心研究來解決這些問題。此外,還確定了克服這些障礙的方法,例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰,但未來對多學科合作、臨床適用性驗證和演算法改進的研究,仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。 + +##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI** +2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia + +Breast cancer has rapidly increased in prevalence in recent years, making it +one of the leading causes of mortality worldwide. Among all cancers, it is by +far the most common. Diagnosing this illness manually requires significant time +and expertise. Since detecting breast cancer is a time-consuming process, +preventing its further spread can be aided by creating machine-based forecasts. +Machine learning and Explainable AI are crucial in classification as they not +only provide accurate predictions but also offer insights into how the model +arrives at its decisions, aiding in the understanding and trustworthiness of +the classification results. In this study, we evaluate and compare the +classification accuracy, precision, recall, and F-1 scores of five different +machine learning methods using a primary dataset (500 patients from Dhaka +Medical College Hospital). Five different supervised machine learning +techniques, including decision tree, random forest, logistic regression, naive +bayes, and XGBoost, have been used to achieve optimal results on our dataset. +Additionally, this study applied SHAP analysis to the XGBoost model to +interpret the model's predictions and understand the impact of each feature on +the model's output. We compared the accuracy with which several algorithms +classified the data, as well as contrasted with other literature in this field. +After final evaluation, this study found that XGBoost achieved the best model +accuracy, which is 97%. + +摘要:近年來,乳癌的盛行率迅速增加,使其成為全球主要的死亡原因之一。在所有癌症中,乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時,因此透過建立機器學習模型來預測,有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要,因為它們不僅可以提供準確的預測,還可以深入了解模型如何做出決策,有助於理解和信賴分類結果。在此研究中,我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數,使用了一個主要的資料集(達卡醫學院醫院的 500 名患者)。五種不同的監督式機器學習技術,包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost,已用於在我們的資料集上取得最佳結果。此外,本研究將 SHAP 分析應用於 XGBoost 模型,以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度,並與該領域的其他文獻進行對比。在最後評估後,本研究發現 XGBoost 達到了最佳的模型準確度,為 97%。 + +##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI** +2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir + +The Deep learning (DL) models for diagnosing breast cancer from mammographic +images often operate as "black boxes", making it difficult for healthcare +professionals to trust and understand their decision-making processes. The +study presents an integrated framework combining Convolutional Neural Networks +(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis +of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an +elaborate data preprocessing pipeline and advanced data augmentation techniques +to counteract dataset limitations and transfer learning using pre-trained +networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of +our study is the evaluation of XAI's effectiveness in interpreting model +predictions, highlighted by utilizing the Hausdorff measure to assess the +alignment between AI-generated explanations and expert annotations +quantitatively. This approach is critical for XAI in promoting trustworthiness +and ethical fairness in AI-assisted diagnostics. The findings from our research +illustrate the effective collaboration between CNNs and XAI in advancing +diagnostic methods for breast cancer, thereby facilitating a more seamless +integration of advanced AI technologies within clinical settings. By enhancing +the interpretability of AI driven decisions, this work lays the groundwork for +improved collaboration between AI systems and medical practitioners, ultimately +enriching patient care. Furthermore, the implications of our research extended +well beyond the current methodologies. It encourages further research into how +to combine multimodal data and improve AI explanations to meet the needs of +clinical practice. + +摘要:深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作,這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構,結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI),以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術,以對抗資料集限制,並採用預先訓練的網路(例如 VGG-16、Inception-V3 和 ResNet)進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性,重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作,從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性,這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎,最終豐富了患者照護。此外,我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋,以滿足臨床實務的需求。 + +##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives** +2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu + +This research presents a novel multimodal data fusion methodology for pain +behavior recognition, integrating statistical correlation analysis with +human-centered insights. Our approach introduces two key innovations: 1) +integrating data-driven statistical relevance weights into the fusion strategy +to effectively utilize complementary information from heterogeneous modalities, +and 2) incorporating human-centric movement characteristics into multimodal +representation learning for detailed modeling of pain behaviors. Validated +across various deep learning architectures, our method demonstrates superior +performance and broad applicability. We propose a customizable framework that +aligns each modality with a suitable classifier based on statistical +significance, advancing personalized and effective multimodal fusion. +Furthermore, our methodology provides explainable analysis of multimodal data, +contributing to interpretable and explainable AI in healthcare. By highlighting +the importance of data diversity and modality-specific representations, we +enhance traditional fusion techniques and set new standards for recognizing +complex pain behaviors. Our findings have significant implications for +promoting patient-centered healthcare interventions and supporting explainable +clinical decision-making. + +摘要:本研究提出了一種創新的多模態數據融合方法,用於疼痛行為識別,將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新:1) 將數據驅動的統計相關權重整合到融合策略中,以有效利用來自異質模態的補充信息,以及 2) 將以人為中心的運動特徵納入多模態表示學習中,以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證,展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架,根據統計顯著性將每個模態與合適的分類器對齊,推進個性化和有效的多模態融合。此外,我們的模型提供對多模態數據的可解釋分析,有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性,我們增強了傳統的融合技術,並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。 + +##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach** +2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini + +Human-centered explainable AI (HCXAI) advocates for the integration of social +aspects into AI explanations. Central to the HCXAI discourse is the Social +Transparency (ST) framework, which aims to make the socio-organizational +context of AI systems accessible to their users. In this work, we suggest +extending the ST framework to address the risks of social misattributions in +Large Language Models (LLMs), particularly in sensitive areas like mental +health. In fact LLMs, which are remarkably capable of simulating roles and +personas, may lead to mismatches between designers' intentions and users' +perceptions of social attributes, risking to promote emotional manipulation and +dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To +address these issues, we propose enhancing the ST framework with a fifth +'W-question' to clarify the specific social attributions assigned to LLMs by +its designers and users. This addition aims to bridge the gap between LLM +capabilities and user perceptions, promoting the ethically responsible +development and use of LLM-based technology. + +摘要:以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架,其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中,我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险,尤其是在心理健康等敏感领域。事实上,LLM 能够出色地模拟角色和人格,这可能导致设计者的意图和用户对社会属性的认知之间出现错配,从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题,我们建议用第五个“W 问题”来增强 ST 框架,以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距,促进基于 LLM 的技术在道德上负责任地开发和使用。 + +##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification** +2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu + +Background: Pneumothorax is an acute thoracic disease caused by abnormal air +collection between the lungs and chest wall. To address the opaqueness often +associated with deep learning (DL) models, explainable artificial intelligence +(XAI) methods have been introduced to outline regions related to pneumothorax +diagnoses made by DL models. However, these explanations sometimes diverge from +actual lesion areas, highlighting the need for further improvement. Method: We +propose a template-guided approach to incorporate the clinical knowledge of +pneumothorax into model explanations generated by XAI methods, thereby +enhancing the quality of these explanations. Utilizing one lesion delineation +created by radiologists, our approach first generates a template that +represents potential areas of pneumothorax occurrence. This template is then +superimposed on model explanations to filter out extraneous explanations that +fall outside the template's boundaries. To validate its efficacy, we carried +out a comparative analysis of three XAI methods with and without our template +guidance when explaining two DL models in two real-world datasets. Results: The +proposed approach consistently improved baseline XAI methods across twelve +benchmark scenarios built on three XAI methods, two DL models, and two +datasets. The average incremental percentages, calculated by the performance +improvements over the baseline performance, were 97.8% in Intersection over +Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model +explanations and ground-truth lesion areas. Conclusions: In the context of +pneumothorax diagnoses, we proposed a template-guided approach for improving AI +explanations. We anticipate that our template guidance will forge a fresh +approach to elucidating AI models by integrating clinical domain expertise. + +摘要:背景:氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習(DL)模型經常伴隨的不透明性,可解釋人工智慧(XAI)方法已被引入,用於概述與 DL 模型做出的氣胸診斷相關的區域。然而,這些解釋有時會與實際病灶區域有所出入,突顯出進一步改進的必要性。方法:我們提出了一種模板引導式方法,將氣胸的臨床知識納入 XAI 方法產生的模型解釋中,從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪,我們的做法首先產生一個模板,用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上,以篩選出超出模板邊界的無關解釋。為了驗證其效力,我們對三種 XAI 方法進行了比較分析,在兩個真實世界資料集中解釋兩個 DL 模型時,分別採用和不採用我們的模板引導。結果:所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中,始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時,透過基準效能的效能改進計算出的平均增量百分比為交集比(IoU)的 97.8% 和骰子相似性係數(DSC)的 94.1%。結論:在氣胸診斷的背景下,我們提出了一種模板引導式方法,用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識,為闡明 AI 模型建立一種新方法。 + +##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures** +2403.01580v1 by Séamus Lankford + +In the current machine translation (MT) landscape, the Transformer +architecture stands out as the gold standard, especially for high-resource +language pairs. This research delves into its efficacy for low-resource +language pairs including both the English$\leftrightarrow$Irish and +English$\leftrightarrow$Marathi language pairs. Notably, the study identifies +the optimal hyperparameters and subword model type to significantly improve the +translation quality of Transformer models for low-resource language pairs. + The scarcity of parallel datasets for low-resource languages can hinder MT +development. To address this, gaHealth was developed, the first bilingual +corpus of health data for the Irish language. Focusing on the health domain, +models developed using this in-domain dataset exhibited very significant +improvements in BLEU score when compared with models from the LoResMT2021 +Shared Task. A subsequent human evaluation using the multidimensional quality +metrics error taxonomy showcased the superior performance of the Transformer +system in reducing both accuracy and fluency errors compared to an RNN-based +counterpart. + Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source +applications streamlined for the development, fine-tuning, and deployment of +neural machine translation models. These tools considerably simplify the setup +and evaluation process, making MT more accessible to both developers and +translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes +eco-friendly natural language processing research by highlighting the +environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM +demonstrated advancements in translation performance for two low-resource +language pairs: English$\leftrightarrow$Irish and +English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 +Shared Task. + +摘要:在當前機器翻譯 (MT) 領域中,Transformer 架構脫穎而出,成為黃金標準,特別是對於高資源語言對。本研究探討其對低資源語言對的效能,包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是,本研究識別出最佳超參數和子詞模型類型,以顯著提高 Transformer 模型對低資源語言對的翻譯品質。 +低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題,開發了 gaHealth,這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域,使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步,與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示,與基於 RNN 的對應模型相比,Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。 +此外,本論文介紹了 adaptNMT 和 adaptMLLM,這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程,讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是,adaptNMT 以 OpenNMT 生態系統為基礎,通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比,adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。 + +##### **Cause and Effect: Can Large Language Models Truly Understand Causality?** +2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha + +With the rise of Large Language Models(LLMs), it has become crucial to +understand their capabilities and limitations in deciphering and explaining the +complex web of causal relationships that language entails. Current methods use +either explicit or implicit causal reasoning, yet there is a strong need for a +unified approach combining both to tackle a wide array of causal relationships +more effectively. This research proposes a novel architecture called Context +Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to +enhance causal reasoning and explainability. The proposed framework +incorporates an explicit causal detection module with ConceptNet and +counterfactual statements, as well as implicit causal detection through LLMs. +Our framework goes one step further with a layer of counterfactual explanations +to accentuate LLMs understanding of causality. The knowledge from ConceptNet +enhances the performance of multiple causal reasoning tasks such as causal +discovery, causal identification and counterfactual reasoning. The +counterfactual sentences add explicit knowledge of the not caused by scenarios. +By combining these powerful modules, our model aims to provide a deeper +understanding of causal relationships, enabling enhanced interpretability. +Evaluation of benchmark datasets shows improved performance across all metrics, +such as accuracy, precision, recall, and F1 scores. We also introduce +CausalNet, a new dataset accompanied by our code, to facilitate further +research in this domain. + +摘要:隨著大型語言模型 (LLM) 的興起,了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理,但強烈需要一種統一的方法,結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構,以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組,以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步,加入一層反事實解釋,以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行,例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組,我們的模型旨在提供對因果關係更深入的理解,實現增強的可解釋性。基準資料集的評估顯示在所有指標(例如準確度、精確度、召回率和 F1 分數)上都有所提升。我們還引入了 CausalNet,一個新的資料集,並附上了我們的程式碼,以促進在這個領域的進一步研究。 + +##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina** +2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya + +Diabetes mellitus (DM) predisposes patients to vascular complications. +Retinal images and vasculature reflect the body's micro- and macrovascular +health. They can be used to diagnose DM complications, including diabetic +retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular +disease, as well as forecast the risk of cardiovascular events. Artificial +intelligence (AI)-enabled systems developed for high-throughput detection of DR +using digitized retinal images have become clinically adopted. Beyond DR +screening, AI integration also holds immense potential to address challenges +associated with the holistic care of the patient with DM. In this work, we aim +to comprehensively review the literature for studies on AI applications based +on retinal images related to DM diagnosis, prognostication, and management. We +will describe the findings of holistic AI-assisted diabetes care, including but +not limited to DR screening, and discuss barriers to implementing such systems, +including issues concerning ethics, data privacy, equitable access, and +explainability. With the ability to evaluate the patient's health status vis a +vis DM complication as well as risk prognostication of future cardiovascular +complications, AI-assisted retinal image analysis has the potential to become a +central tool for modern personalized medicine in patients with DM. + +摘要:糖尿病(DM)使患者容易出現血管併發症。 +視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症,包括糖尿病視網膜病變(DR)、神經病變、腎病和動脈粥樣硬化性心血管疾病,以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧(AI)啟用系統已在臨床採用。除了 DR 篩檢外,AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中,我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻,這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現,包括但不限於 DR 篩檢,並討論實施此類系統的障礙,包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況,同時考量糖尿病併發症以及未來心血管併發症的風險預後,AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。 + + +### Medical +|Publish Date|Title|Authors|Homepage|Code| +| :---: | :---: | :---: | :---: | :---: | +|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| +|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| +|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| +|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| +|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| +|**2025-02-20**|**MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models**|Shrey Pandit et.al.|[2502.14302v1](http://arxiv.org/abs/2502.14302v1)|null| +|**2025-02-20**|**EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement**|Wenhui Zhu et.al.|[2502.14260v1](http://arxiv.org/abs/2502.14260v1)|null| +|**2025-02-19**|**Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning**|Cole Gawin et.al.|[2502.14086v1](http://arxiv.org/abs/2502.14086v1)|null| +|**2025-02-19**|**Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging**|Shansong Wang et.al.|[2502.14064v1](http://arxiv.org/abs/2502.14064v1)|null| +|**2025-02-19**|**VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare**|Anudeex Shetty et.al.|[2502.13775v1](http://arxiv.org/abs/2502.13775v1)|null| +|**2025-02-19**|**PeerQA: A Scientific Question Answering Dataset from Peer Reviews**|Tim Baumgärtner et.al.|[2502.13668v1](http://arxiv.org/abs/2502.13668v1)|[link](https://github.com/ukplab/peerqa)| +|**2025-02-19**|**Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs**|Yushi Feng et.al.|[2502.13555v1](http://arxiv.org/abs/2502.13555v1)|[link](https://github.com/ys-feng/DemoGraph)| +|**2025-02-19**|**MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis**|Wei Dai et.al.|[2502.13524v1](http://arxiv.org/abs/2502.13524v1)|[link](https://github.com/anthonyweidai/MobileViM_3D)| +|**2025-02-19**|**Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion**|Shuai Niu et.al.|[2502.13509v1](http://arxiv.org/abs/2502.13509v1)|null| +|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| +|**2025-02-19**|**RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering**|Sichu Liang et.al.|[2502.13361v1](http://arxiv.org/abs/2502.13361v1)|null| +|**2025-02-18**|**Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance**|Tejas Srinivasan et.al.|[2502.13321v1](http://arxiv.org/abs/2502.13321v1)|null| +|**2025-02-18**|**Prediction of Clinical Complication Onset using Neural Point Processes**|Sachini Weerasekara et.al.|[2502.13290v1](http://arxiv.org/abs/2502.13290v1)|null| +|**2025-02-18**|**SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?**|Yucheng Shi et.al.|[2502.13233v1](http://arxiv.org/abs/2502.13233v1)|null| +|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null| +|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null| +|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null| +|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Li et.al.|[2502.12825v2](http://arxiv.org/abs/2502.12825v2)|null| +|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|[link](https://github.com/Avenge-PRC777/LLM-Safety-For-Children-Code)| +|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null| +|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null| +|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|[link](https://github.com/AmmarKheder/AQ-Net)| +|**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null| +|**2025-02-17**|**LLM Agents Making Agent Tools**|Georg Wölflein et.al.|[2502.11705v1](http://arxiv.org/abs/2502.11705v1)|null| +|**2025-02-17**|**MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression**|Linjie Mu et.al.|[2502.11651v1](http://arxiv.org/abs/2502.11651v1)|[link](https://github.com/linjiemu/mmxu)| +|**2025-02-17**|**A Survey of Personalized Large Language Models: Progress and Future Directions**|Jiahong Liu et.al.|[2502.11528v1](http://arxiv.org/abs/2502.11528v1)|null| +|**2025-02-17**|**Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos**|Xiangxiang Cui et.al.|[2502.11481v1](http://arxiv.org/abs/2502.11481v1)|null| +|**2025-02-17**|**Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation**|Yanyan Wang et.al.|[2502.11456v1](http://arxiv.org/abs/2502.11456v1)|[link](https://github.com/Yaan-Wang/CRLN)| +|**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null| +|**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|[link](https://github.com/sohyu1/rt-demt)| +|**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|[link](https://github.com/alexlecu/llmkgraph)| +|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null| +|**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|[link](https://github.com/clmfap/clmfap)| +|**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null| +|**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null| +|**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|[link](https://github.com/hasakixie123/llm-evaluator-uncertainty)| +|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)| +|**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null| +|**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null| +|**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null| +|**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null| +|**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v2](http://arxiv.org/abs/2502.10526v2)|null| +|**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null| +|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| +|**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null| +|**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null| +|**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null| +|**2025-02-14**|**HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation**|Tianwei Lin et.al.|[2502.09838v2](http://arxiv.org/abs/2502.09838v2)|[link](https://github.com/dcdmllm/healthgpt)| +|**2025-02-13**|**Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games**|Tong Yang et.al.|[2502.09780v1](http://arxiv.org/abs/2502.09780v1)|null| +|**2025-02-13**|**The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention**|Bereket A. Yilma et.al.|[2502.09757v1](http://arxiv.org/abs/2502.09757v1)|null| +|**2025-02-13**|**A CNN Approach to Automated Detection and Classification of Brain Tumors**|Md. Zahid Hasan et.al.|[2502.09731v1](http://arxiv.org/abs/2502.09731v1)|null| +|**2025-02-13**|**Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data**|Yu Leng et.al.|[2502.09715v1](http://arxiv.org/abs/2502.09715v1)|null| +|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null| +|**2025-02-13**|**Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling**|Benjamin D. Killeen et.al.|[2502.09688v1](http://arxiv.org/abs/2502.09688v1)|null| +|**2025-02-13**|**Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models**|Wiktoria Mieleszczenko-Kowszewicz et.al.|[2502.09687v1](http://arxiv.org/abs/2502.09687v1)|null| +|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null| +|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null| +|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| +|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null| +|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null| +|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null| +|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)| +|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)| +|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null| +|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null| +|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null| +|**2025-02-12**|**Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models**|Hasin Rehana et.al.|[2502.09659v1](http://arxiv.org/abs/2502.09659v1)|null| +|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null| +|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null| +|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v2](http://arxiv.org/abs/2502.07752v2)|null| +|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)| +|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)| +|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null| +|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)| +|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null| +|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null| +|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null| +|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null| +|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null| +|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null| +|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null| +|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null| +|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null| +|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null| +|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| +|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null| +|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)| +|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null| +|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null| +|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null| +|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null| +|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null| +|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null| +|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null| +|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)| + +#### Abstracts +##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** +2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub + +Foundation models are becoming increasingly effective in the medical domain, +offering pre-trained models on large datasets that can be readily adapted for +downstream tasks. Despite progress, fetal ultrasound images remain a +challenging domain for foundation models due to their inherent complexity, +often requiring substantial additional training and facing limitations due to +the scarcity of paired multimodal data. To overcome these challenges, here we +introduce FetalCLIP, a vision-language foundation model capable of generating +universal representation of fetal ultrasound images. FetalCLIP was pre-trained +using a multimodal learning approach on a diverse dataset of 210,035 fetal +ultrasound images paired with text. This represents the largest paired dataset +of its kind used for foundation model development to date. This unique training +approach allows FetalCLIP to effectively learn the intricate anatomical +features present in fetal ultrasound images, resulting in robust +representations that can be used for a variety of downstream applications. In +extensive benchmarking across a range of key fetal ultrasound applications, +including classification, gestational age estimation, congenital heart defect +(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all +baselines while demonstrating remarkable generalizability and strong +performance even with limited labeled data. We plan to release the FetalCLIP +model publicly for the benefit of the broader scientific community. + +摘要:基礎模型在醫療領域正變得越來越有效, +提供在大型資料集上預先訓練的模型,可輕鬆適應 +下游任務。儘管有進展,但胎兒超音波影像仍然是 +基礎模型的挑戰領域,因為它們固有的複雜性, +通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 +介紹 FetalCLIP,一種能夠產生 +胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 +超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 +方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 +表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 +(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 +效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 + +##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** +2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes + +Fact verification (FV) aims to assess the veracity of a claim based on +relevant evidence. The traditional approach for automated FV includes a +three-part pipeline relying on short evidence snippets and encoder-only +inference models. More recent approaches leverage the multi-turn nature of LLMs +to address FV as a step-by-step problem where questions inquiring additional +context are generated and answered until there is enough information to make a +decision. This iterative method makes the verification process rational and +explainable. While these methods have been tested for encyclopedic claims, +exploration on domain-specific and realistic claims is missing. In this work, +we apply an iterative FV system on three medical fact-checking datasets and +evaluate it with multiple settings, including different LLMs, external web +search, and structured reasoning using logic predicates. We demonstrate +improvements in the final performance over traditional approaches and the high +potential of step-by-step FV systems for domain-specific claims. + +摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 + +##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** +2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari + +Medical images are acquired at high resolutions with large fields of view in +order to capture fine-grained features necessary for clinical decision-making. +Consequently, training deep learning models on medical images can incur large +computational costs. In this work, we address the challenge of downsizing +medical images in order to improve downstream computational efficiency while +preserving clinically-relevant features. We introduce MedVAE, a family of six +large-scale 2D and 3D autoencoders capable of encoding medical images as +downsized latent representations and decoding latent representations back to +high-resolution images. We train MedVAE autoencoders using a novel two-stage +training approach with 1,052,730 medical images. Across diverse tasks obtained +from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent +representations in place of high-resolution images when training downstream +models can lead to efficiency benefits (up to 70x improvement in throughput) +while simultaneously preserving clinically-relevant features and (2) MedVAE can +decode latent representations back to high-resolution images with high +fidelity. Our work demonstrates that large-scale, generalizable autoencoders +can help address critical efficiency challenges in the medical domain. Our code +is available at https://github.com/StanfordMIMI/MedVAE. + +摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 + +##### **Data-Constrained Synthesis of Training Data for De-Identification** +2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis + +Many sensitive domains -- such as the clinical domain -- lack widely +available datasets due to privacy risks. The increasing generative capabilities +of large language models (LLMs) have made synthetic datasets a viable path +forward. In this study, we domain-adapt LLMs to the clinical domain and +generate synthetic clinical texts that are machine-annotated with tags for +personally identifiable information using capable encoder-based NER models. The +synthetic corpora are then used to train synthetic NER models. The results show +that training NER models using synthetic corpora incurs only a small drop in +predictive performance. The limits of this process are investigated in a +systematic ablation study -- using both Swedish and Spanish data. Our analysis +shows that smaller datasets can be sufficient for domain-adapting LLMs for data +synthesis. Instead, the effectiveness of this process is almost entirely +contingent on the performance of the machine-annotating NER models trained +using the original data. + +摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 + +##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** +2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu + +Protein backbone generation plays a central role in de novo protein design +and is significant for many biological and medical applications. Although +diffusion and flow-based generative models provide potential solutions to this +challenging task, they often generate proteins with undesired designability and +suffer computational inefficiency. In this study, we propose a novel rectified +quaternion flow (ReQFlow) matching method for fast and high-quality protein +backbone generation. In particular, our method generates a local translation +and a 3D rotation from random noise for each residue in a protein chain, which +represents each 3D rotation as a unit quaternion and constructs its flow by +spherical linear interpolation (SLERP) in an exponential format. We train the +model by quaternion flow (QFlow) matching with guaranteed numerical stability +and rectify the QFlow model to accelerate its inference and improve the +designability of generated protein backbones, leading to the proposed ReQFlow +model. Experiments show that ReQFlow achieves state-of-the-art performance in +protein backbone generation while requiring much fewer sampling steps and +significantly less inference time (e.g., being 37x faster than RFDiffusion and +62x faster than Genie2 when generating a backbone of length 300), demonstrating +its effectiveness and efficiency. The code is available at +https://github.com/AngxiaoYue/ReQFlow. + +摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 + +##### **MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models** +2502.14302v1 by Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding + +Advancements in Large Language Models (LLMs) and their increasing use in +medical question-answering necessitate rigorous evaluation of their +reliability. A critical challenge lies in hallucination, where models generate +plausible yet factually incorrect outputs. In the medical domain, this poses +serious risks to patient safety and clinical decision-making. To address this, +we introduce MedHallu, the first benchmark specifically designed for medical +hallucination detection. MedHallu comprises 10,000 high-quality question-answer +pairs derived from PubMedQA, with hallucinated answers systematically generated +through a controlled pipeline. Our experiments show that state-of-the-art LLMs, +including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, +struggle with this binary hallucination detection task, with the best model +achieving an F1 score as low as 0.625 for detecting "hard" category +hallucinations. Using bidirectional entailment clustering, we show that +harder-to-detect hallucinations are semantically closer to ground truth. +Through experiments, we also show incorporating domain-specific knowledge and +introducing a "not sure" category as one of the answer categories improves the +precision and F1 scores by up to 38% relative to baselines. + +摘要:大型語言模型 (LLM) 的進步及其在醫療問答中的使用日益增加,因此需要嚴格評估其可靠性。一個關鍵的挑戰在於幻覺,模型會產生看似合理但事實上不正確的輸出。在醫療領域,這對患者安全和臨床決策構成嚴重風險。為了解決此問題,我們推出了 MedHallu,這是第一個專門設計用於檢測醫療幻覺的基準。MedHallu 包含 10,000 個從 PubMedQA 衍生的高品質問答對,並透過受控管道系統性地產生幻覺答案。我們的實驗顯示,包括 GPT-4o、Llama-3.1 和經過醫學微調的 UltraMedical 在內的最新 LLM 難以執行這個二元幻覺檢測任務,最佳模型在檢測「困難」類別幻覺時達到的 F1 分數低至 0.625。使用雙向蘊涵聚類,我們表明較難檢測的幻覺在語義上更接近真實。透過實驗,我們還表明,納入特定領域的知識並將「不確定」類別作為其中一個答案類別,可以將精確度和 F1 分數相對於基線提高多達 38%。 + +##### **EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement** +2502.14260v1 by Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang + +Over the past decade, generative models have achieved significant success in +enhancement fundus images.However, the evaluation of these models still +presents a considerable challenge. A comprehensive evaluation benchmark for +fundus image enhancement is indispensable for three main reasons: 1) The +existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to +downstream real-world clinical research (e.g., Vessel morphology consistency). +2) There is a lack of comprehensive evaluation for both paired and unpaired +enhancement methods, along with the need for expert protocols to accurately +assess clinical value. 3) An ideal evaluation system should provide insights to +inform future developments of fundus image enhancement. To this end, we propose +a novel comprehensive benchmark, EyeBench, to provide insights that align +enhancement models with clinical needs, offering a foundation for future work +to improve the clinical relevance and applicability of generative models for +fundus image enhancement. EyeBench has three appealing properties: 1) +multi-dimensional clinical alignment downstream evaluation: In addition to +evaluating the enhancement task, we provide several clinically significant +downstream tasks for fundus images, including vessel segmentation, DR grading, +denoising generalization, and lesion segmentation. 2) Medical expert-guided +evaluation design: We introduce a novel dataset that promote comprehensive and +fair comparisons between paired and unpaired methods and includes a manual +evaluation protocol by medical experts. 3) Valuable insights: Our benchmark +study provides a comprehensive and rigorous evaluation of existing methods +across different downstream tasks, assisting medical experts in making informed +choices. Additionally, we offer further analysis of the challenges faced by +existing methods. The code is available at +\url{https://github.com/Retinal-Research/EyeBench} + +摘要:在過去的十年中,生成模型在增強眼底影像方面取得了顯著的成功。然而,這些模型的評估仍然是一個相當大的挑戰。一個全面的眼底影像增強評估基準對於三個主要原因是不可或缺的:1) 現有的去噪指標(例如 PSNR、SSIM)很難擴展到下游的真實世界臨床研究(例如血管形態一致性)。2) 缺乏對配對和非配對增強方法的全面評估,以及需要專家協議來準確評估臨床價值。3) 一個理想的評估系統應該提供見解,以告知眼底影像增強的未來發展。為此,我們提出了一個新的綜合基準 EyeBench,以提供見解,將增強模型與臨床需求相結合,為未來的研究奠定基礎,以提高生成模型在眼底影像增強方面的臨床相關性和適用性。EyeBench 有三個吸引人的特性:1) 多維臨床對齊下游評估:除了評估增強任務外,我們還為眼底影像提供了幾個臨床上重要的下游任務,包括血管分割、DR 分級、去噪泛化和病灶分割。2) 醫學專家指導的評估設計:我們引入了一個新的數據集,以促進對配對和非配對方法的全面和公平比較,並包括由醫學專家進行的手動評估協議。3) 有價值的見解:我們的基準研究提供了對現有方法在不同下游任務中的全面且嚴格的評估,協助醫學專家做出明智的選擇。此外,我們還進一步分析了現有方法面臨的挑戰。程式碼可在 \url{https://github.com/Retinal-Research/EyeBench} 獲得 + +##### **Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning** +2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal + +Large language models (LLMs) have achieved remarkable performance in +generating human-like text and solving reasoning tasks of moderate complexity, +such as question-answering and mathematical problem-solving. However, their +capabilities in tasks requiring deeper cognitive skills, such as common-sense +understanding and abstract reasoning, remain under-explored. In this paper, we +systematically evaluate abstract common-sense reasoning in LLMs using the +ConceptNet knowledge graph. We propose two prompting approaches: instruct +prompting, where models predict plausible semantic relationships based on +provided definitions, and few-shot prompting, where models identify relations +using examples as guidance. Our experiments with the gpt-4o-mini model show +that in instruct prompting, consistent performance is obtained when ranking +multiple relations but with substantial decline when the model is restricted to +predicting only one relation. In few-shot prompting, the model's accuracy +improves significantly when selecting from five relations rather than the full +set, although with notable bias toward certain relations. These results suggest +significant gaps still, even in commercially used LLMs' abstract common-sense +reasoning abilities, compared to human-level understanding. However, the +findings also highlight the promise of careful prompt engineering, based on +selective retrieval, for obtaining better performance. + +摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。 + +##### **Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging** +2502.14064v1 by Shansong Wang, Mojtaba Safari, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang + +Vision foundation models (VFMs) are pre-trained on extensive image datasets +to learn general representations for diverse types of data. These models can +subsequently be fine-tuned for specific downstream tasks, significantly +boosting performance across a broad range of applications. However, existing +vision foundation models that claim to be applicable to various radiology tasks +are mostly pre-trained on 3D computed tomography (CT), which benefits from the +availability of extensive 3D CT databases. Significant differences between CT +and magnetic resonance imaging (MRI) in imaging principles, signal +characteristics, and data distribution may hinder their practical performance +and versatility in MRI-specific applications. Here, we propose Triad, a vision +foundation model for 3D MRI. Triad adopts a widely used autoencoder +architecture to learn robust representations from 131,170 3D MRI volumes and +uses organ-independent imaging descriptions to constrain the semantic +distribution of the visual modality. The above pre-training dataset is called +Triad-131K, which is currently the largest 3D MRI pre-training dataset. We +evaluate Triad across three tasks, namely, organ/tumor segmentation, +organ/cancer classification, and medical image registration, in two data +modalities (within-domain and out-of-domain) settings using 25 downstream +datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad +improves segmentation performance by 6.88% compared to nnUNet-Scratch across 17 +datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in +classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% +compared to SwinUNETR-Scratch in registration tasks across two datasets. Our +study demonstrates that pre-training can maximize performance when the data +modalities and organs of upstream and downstream tasks are consistent. + +摘要:視覺基礎模型 (VFM) 在廣泛的影像資料集上進行預訓練,以學習各種資料類型的通用表示。這些模型隨後可以針對特定的下游任務進行微調,大幅提升各種應用程式的效能。然而,現有的視覺基礎模型聲稱適用於各種放射學任務,但大多是針對 3D 電腦斷層攝影 (CT) 進行預訓練,這得利於廣泛的 3D CT 資料庫。CT 和磁振造影 (MRI) 在影像原理、訊號特性和資料分佈上的顯著差異,可能會阻礙其在 MRI 特定應用中的實際效能和多功能性。在此,我們提出 Triad,一個適用於 3D MRI 的視覺基礎模型。Triad 採用廣泛使用的自動編碼器架構,從 131,170 個 3D MRI 體積中學習穩健的表示,並使用與器官無關的影像描述來約束視覺模式的語義分佈。上述預訓練資料集稱為 Triad-131K,目前是最大的 3D MRI 預訓練資料集。我們在三個任務中評估 Triad,即器官/腫瘤分割、器官/癌症分類和醫學影像配準,在兩個資料模式(域內和域外)設定中使用 25 個下游資料集。透過使用 Triad 的預訓練權重初始化模型,nnUNet-Triad 在 17 個資料集中的分割效能比 nnUNet-Scratch 提升了 6.88%。Swin-B-Triad 在五個資料集的分類任務中,比 Swin-B-Scratch 提升了 3.97%。SwinUNETR-Triad 在兩個資料集的配準任務中,比 SwinUNETR-Scratch 提升了 4.00%。我們的研究證明,當上游和下游任務的資料模式和器官一致時,預訓練可以最大化效能。 + +##### **VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare** +2502.13775v1 by Anudeex Shetty, Amin Beheshti, Mark Dras, Usman Naseem + +Alignment techniques have become central to ensuring that Large Language +Models (LLMs) generate outputs consistent with human values. However, existing +alignment paradigms often model an averaged or monolithic preference, failing +to account for the diversity of perspectives across cultures, demographics, and +communities. This limitation is particularly critical in health-related +scenarios, where plurality is essential due to the influence of culture, +religion, personal values, and conflicting opinions. Despite progress in +pluralistic alignment, no prior work has focused on health, likely due to the +unavailability of publicly available datasets. To address this gap, we +introduce VITAL, a new benchmark dataset comprising 13.1K value-laden +situations and 5.4K multiple-choice questions focused on health, designed to +assess and benchmark pluralistic alignment methodologies. Through extensive +evaluation of eight LLMs of varying sizes, we demonstrate that existing +pluralistic alignment techniques fall short in effectively accommodating +diverse healthcare beliefs, underscoring the need for tailored AI alignment in +specific domains. This work highlights the limitations of current approaches +and lays the groundwork for developing health-specific alignment solutions. + +摘要:對齊技術已成為確保大型語言模型 (LLM) 產生與人類價值觀一致的輸出的核心。然而,現有的對齊範例通常會建模平均或單一的偏好,無法考量跨文化、人口統計和社群的不同觀點。此限制在與健康相關的場景中特別重要,因為在這種場景中,由於文化、宗教、個人價值觀和相互衝突的意見的影響,多元性是必要的。儘管多元對齊已取得進展,但沒有任何先前的工作專注於健康,這可能是因為缺乏公開可用的資料集。為了解決此差距,我們引入了 VITAL,這是一個新的基準資料集,包含 13.1K 個價值觀念的情境和 5.4K 個選擇題,專注於健康,旨在評估和基準多元對齊方法。透過對八個不同規模的 LLM 進行廣泛評估,我們證明現有的多元對齊技術無法有效適應不同的醫療保健信念,這強調了在特定領域中需要量身打造的 AI 對齊。這項工作突顯了當前方法的限制,並為開發特定於健康的對齊解決方案奠定了基礎。 + +##### **PeerQA: A Scientific Question Answering Dataset from Peer Reviews** +2502.13668v1 by Tim Baumgärtner, Ted Briscoe, Iryna Gurevych + +We present PeerQA, a real-world, scientific, document-level Question +Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, +which contain questions that reviewers raised while thoroughly examining the +scientific article. Answers have been annotated by the original authors of each +paper. The dataset contains 579 QA pairs from 208 academic articles, with a +majority from ML and NLP, as well as a subset of other scientific communities +like Geoscience and Public Health. PeerQA supports three critical tasks for +developing practical QA systems: Evidence retrieval, unanswerable question +classification, and answer generation. We provide a detailed analysis of the +collected dataset and conduct experiments establishing baseline systems for all +three tasks. Our experiments and analyses reveal the need for +decontextualization in document-level retrieval, where we find that even simple +decontextualization approaches consistently improve retrieval performance +across architectures. On answer generation, PeerQA serves as a challenging +benchmark for long-context modeling, as the papers have an average size of 12k +tokens. Our code and data is available at https://github.com/UKPLab/peerqa. + +摘要:我們提出 PeerQA,一個真實世界、科學的、文件層級的問答 (QA) 資料集。PeerQA 問題來自於同行評審,其中包含審查者在徹底審查科學文章時提出的問題。答案是由每篇論文的原始作者註解的。此資料集包含來自 208 篇學術文章的 579 個 QA 對,其中大部分來自 ML 和 NLP,以及其他科學社群(例如地球科學和公共衛生)的子集。PeerQA 支援開發實用 QA 系統的三項重要任務:證據檢索、無解答問題分類和答案產生。我們提供收集到的資料集的詳細分析,並進行實驗,為所有三項任務建立基準系統。我們的實驗和分析揭示了在文件層級檢索中去脈絡化的必要性,我們發現即使是簡單的去脈絡化方法也能持續改善跨架構的檢索效能。在答案產生方面,PeerQA 是一個用於長脈絡建模的具挑戰性基準,因為論文的平均大小為 12k 個符號。我們的程式碼和資料可於 https://github.com/UKPLab/peerqa 取得。 + +##### **Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs** +2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu + +Data augmentation is necessary for graph representation learning due to the +scarcity and noise present in graph data. Most of the existing augmentation +methods overlook the context information inherited from the dataset as they +rely solely on the graph structure for augmentation. Despite the success of +some large language model-based (LLM) graph learning methods, they are mostly +white-box which require access to the weights or latent features from the +open-access LLMs, making them difficult to be democratized for everyone as +existing LLMs are mostly closed-source for commercial considerations. To +overcome these limitations, we propose a black-box context-driven graph data +augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the +text prompt as context-related information, we task the LLM with generating +knowledge graphs (KGs), which allow us to capture the structural interactions +from the text outputs. We then design a dynamic merging schema to +stochastically integrate the LLM-generated KGs into the original graph during +training. To control the sparsity of the augmented graph, we further devise a +granularity-aware prompting strategy and an instruction fine-tuning module, +which seamlessly generates text prompts according to different granularity +levels of the dataset. Extensive experiments on various graph learning tasks +validate the effectiveness of our method over existing graph data augmentation +methods. Notably, our approach excels in scenarios involving electronic health +records (EHRs), which validates its maximal utilization of contextual +knowledge, leading to enhanced predictive performance and interpretability. + +摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。 + +##### **MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis** +2502.13524v1 by Wei Dai, Steven Wang, Jun Liu + +Efficient evaluation of three-dimensional (3D) medical images is crucial for +diagnostic and therapeutic practices in healthcare. Recent years have seen a +substantial uptake in applying deep learning and computer vision to analyse and +interpret medical images. Traditional approaches, such as convolutional neural +networks (CNNs) and vision transformers (ViTs), face significant computational +challenges, prompting the need for architectural advancements. Recent efforts +have led to the introduction of novel architectures like the ``Mamba'' model as +alternative solutions to traditional CNNs or ViTs. The Mamba model excels in +the linear processing of one-dimensional data with low computational demands. +However, Mamba's potential for 3D medical image analysis remains underexplored +and could face significant computational challenges as the dimension increases. +This manuscript presents MobileViM, a streamlined architecture for efficient +segmentation of 3D medical images. In the MobileViM network, we invent a new +dimension-independent mechanism and a dual-direction traversing approach to +incorporate with a vision-Mamba-based framework. MobileViM also features a +cross-scale bridging technique to improve efficiency and accuracy across +various medical imaging modalities. With these enhancements, MobileViM achieves +segmentation speeds exceeding 90 frames per second (FPS) on a single graphics +processing unit (i.e., NVIDIA RTX 4090). This performance is over 24 FPS faster +than the state-of-the-art deep learning models for processing 3D images with +the same computational resources. In addition, experimental evaluations +demonstrate that MobileViM delivers superior performance, with Dice similarity +scores reaching 92.72%, 86.69%, 80.46%, and 77.43% for PENGWIN, BraTS2024, +ATLAS, and Toothfairy2 datasets, respectively, which significantly surpasses +existing models. + +摘要:有效評估三維 (3D) 醫學影像對於醫療保健中的診斷和治療實務至關重要。近年來,將深度學習和電腦視覺應用於分析和詮釋醫學影像的應用大幅增加。傳統方法,例如卷積神經網路 (CNN) 和視覺Transformer (ViT),面臨重大的運算挑戰,促使需要架構上的進步。最近的努力已導致引進創新的架構,例如「Mamba」模型,作為傳統 CNN 或 ViT 的替代解決方案。Mamba 模型擅長以低運算需求進行一維資料的線性處理。然而,Mamba 在 3D 醫學影像分析方面的潛力仍未被充分探索,並且隨著維度的增加可能會面臨重大的運算挑戰。本手稿提出 MobileViM,這是一種簡化的架構,可有效分割 3D 醫學影像。在 MobileViM 網路中,我們發明了一種新的與維度無關的機制和雙向遍歷方法,以與基於視覺 Mamba 的架構結合。MobileViM 還具備跨尺度橋接技術,以提高各種醫學影像模式的效率和準確性。透過這些增強功能,MobileViM 在單一顯示卡 (即 NVIDIA RTX 4090) 上達到了每秒超過 90 幀 (FPS) 的分割速度。此效能比現有最先進的深度學習模型快了超過 24 FPS,這些模型使用相同的運算資源處理 3D 影像。此外,實驗評估證明 MobileViM 提供了卓越的效能,Dice 相似性評分對於 PENGWIN、BraTS2024、ATLAS 和 Toothfairy2 資料集分別達到 92.72%、86.69%、80.46% 和 77.43%,顯著超越現有模型。 + +##### **Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion** +2502.13509v1 by Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang + +Large language models (LLMs) have shown remarkable performance in +vision-language tasks, but their application in the medical field remains +underexplored, particularly for integrating structured time series data with +unstructured clinical notes. In clinical practice, dynamic time series data +such as lab test results capture critical temporal patterns, while clinical +notes provide rich semantic context. Merging these modalities is challenging +due to the inherent differences between continuous signals and discrete text. +To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal +framework that employs prompt-guided learning to unify these heterogeneous data +types. Our approach leverages lightweight anomaly detection to generate anomaly +captions that serve as prompts, guiding the encoding of raw time series data +into informative embeddings. These embeddings are aligned with textual +representations in a shared latent space, preserving fine-grained temporal +nuances alongside semantic insights. Furthermore, our framework incorporates +tailored self-supervised objectives to enhance both intra- and inter-modal +alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world +datasets, and the results demonstrate that our method consistently outperforms +state-of-the-art approaches. + +摘要:大型語言模型(LLM)在視覺語言任務中表現出色,但其在醫療領域的應用仍未得到充分探索,特別是在將結構化時間序列數據與非結構化臨床筆記整合方面。在臨床實務中,動態時間序列數據(例如實驗室檢驗結果)會擷取關鍵的時間模式,而臨床筆記則提供豐富的語意脈絡。由於連續訊號與離散文字之間的固有差異,合併這些方式具有挑戰性。為了彌補這個差距,我們引入了 ProMedTS,這是一個新穎的自監督多模態框架,採用提示引導學習來統一這些異質化的數據類型。我們的做法利用輕量級異常偵測來產生異常標題,作為提示,引導將原始時間序列數據編碼成資訊性的嵌入。這些嵌入與共享潛在空間中的文字表示對齊,同時保留細微的時間差異和語意見解。此外,我們的框架納入了客製化的自監督目標,以增強模態內和模態間對齊。我們在疾病診斷任務中使用真實世界的數據集評估 ProMedTS,結果表明,我們的模型始終優於最先進的方法。 + +##### **Towards a perturbation-based explanation for medical AI as differentiable programs** +2502.14001v1 by Takeshi Abe, Yoshiyuki Asai + +Recent advancement in machine learning algorithms reaches a point where +medical devices can be equipped with artificial intelligence (AI) models for +diagnostic support and routine automation in clinical settings. In medicine and +healthcare, there is a particular demand for sufficient and objective +explainability of the outcome generated by AI models. However, AI models are +generally considered as black boxes due to their complexity, and the +computational process leading to their response is often opaque. Although +several methods have been proposed to explain the behavior of models by +evaluating the importance of each feature in discrimination and prediction, +they may suffer from biases and opacities arising from the scale and sampling +protocol of the dataset used for training or testing. To overcome the +shortcomings of existing methods, we explore an alternative approach to provide +an objective explanation of AI models that can be defined independently of the +learning process and does not require additional data. As a preliminary study +for this direction of research, this work examines a numerical availability of +the Jacobian matrix of deep learning models that measures how stably a model +responses against small perturbations added to the input. The indicator, if +available, are calculated from a trained AI model for a given target input. +This is a first step towards a perturbation-based explanation, which will +assist medical practitioners in understanding and interpreting the response of +the AI model in its clinical application. + +摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 + +##### **RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering** +2502.13361v1 by Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou + +Medical question answering requires extensive access to specialized +conceptual knowledge. The current paradigm, Retrieval-Augmented Generation +(RAG), acquires expertise medical knowledge through large-scale corpus +retrieval and uses this knowledge to guide a general-purpose large language +model (LLM) for generating answers. However, existing retrieval approaches +often overlook the importance of factual knowledge, which limits the relevance +of retrieved conceptual knowledge and restricts its applicability in real-world +scenarios, such as clinical decision-making based on Electronic Health Records +(EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval +framework that retrieves both relevant factual and conceptual knowledge from +dual sources (i.e., EHRs and the corpus), allowing them to interact and refine +each another. Through extensive evaluation across three factual-aware medical +question answering benchmarks, RGAR establishes a new state-of-the-art +performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model +with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings +demonstrate the benefit of extracting factual knowledge for retrieval, which +consistently yields improved generation quality. + +摘要:醫療問題解答需要大量取得專業概念知識。目前的典範,檢索增強生成(RAG),透過大規模語料庫檢索取得專業醫療知識,並使用此知識引導通用大型語言模型(LLM)來產生答案。然而,現有的檢索方法經常忽略事實知識的重要性,這會限制檢索到的概念知識的相關性,並限制其在現實世界情境中的適用性,例如基於電子健康記錄(EHR)的臨床決策制定。本文介紹 RGAR,一個遞迴生成增強檢索架構,從雙重來源(即 EHR 和語料庫)檢索相關的事實和概念知識,讓它們互動並互相精煉。透過在三個事實感知醫療問題解答基準上進行廣泛評估,RGAR 在醫療 RAG 系統中建立了新的最先進效能。值得注意的是,採用 RGAR 的 Llama-3.1-8B-Instruct 模型超越了規模大得多的 RAG 增強型 GPT-3.5。我們的研究結果證明了提取事實知識以進行檢索的好處,這會持續產生改善的生成品質。 + +##### **Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance** +2502.13321v1 by Tejas Srinivasan, Jesse Thomason + +Trust biases how users rely on AI recommendations in AI-assisted +decision-making tasks, with low and high levels of trust resulting in increased +under- and over-reliance, respectively. We propose that AI assistants should +adapt their behavior through trust-adaptive interventions to mitigate such +inappropriate reliance. For instance, when user trust is low, providing an +explanation can elicit more careful consideration of the assistant's advice by +the user. In two decision-making scenarios -- laypeople answering science +questions and doctors making medical diagnoses -- we find that providing +supporting and counter-explanations during moments of low and high trust, +respectively, yields up to 38% reduction in inappropriate reliance and 20% +improvement in decision accuracy. We are similarly able to reduce over-reliance +by adaptively inserting forced pauses to promote deliberation. Our results +highlight how AI adaptation to user trust facilitates appropriate reliance, +presenting exciting avenues for improving human-AI collaboration. + +摘要:信任偏見影響使用者在 AI 輔助決策任務中如何依賴 AI 建議,信任程度低和高分別導致依賴不足和過度依賴。我們建議 AI 助理應透過信任適應式干預調整其行為,以減輕這種不適當的依賴。例如,當使用者信任度低時,提供解釋可以引發使用者更仔細地考慮助理的建議。在兩種決策情境中——外行人回答科學問題和醫生進行醫療診斷——我們發現,分別在信任度低和高的時刻提供支持性和反向解釋,可以將不適當的依賴降低多達 38%,並將決策準確性提高 20%。我們同樣能夠透過適應性地插入強制暫停來促進審議,以減少過度依賴。我們的結果強調 AI 如何適應使用者信任以促進適當的依賴,為改善人機協作提供了令人興奮的途徑。 + +##### **Prediction of Clinical Complication Onset using Neural Point Processes** +2502.13290v1 by Sachini Weerasekara, Sagar Kamarthi, Jacqueline Isaacs + +Predicting medical events in advance within critical care settings is +paramount for patient outcomes and resource management. Utilizing predictive +models, healthcare providers can anticipate issues such as cardiac arrest, +sepsis, or respiratory failure before they manifest. Recently, there has been a +surge in research focusing on forecasting adverse medical event onsets prior to +clinical manifestation using machine learning. However, while these models +provide temporal prognostic predictions for the occurrence of a specific +adverse event of interest within defined time intervals, their interpretability +often remains a challenge. In this work, we explore the applicability of neural +temporal point processes in the context of adverse event onset prediction, with +the aim of explaining clinical pathways and providing interpretable insights. +Our experiments span six state-of-the-art neural point processes and six +critical care datasets, each focusing on the onset of distinct adverse events. +This work represents a novel application class of neural temporal point +processes in event prediction. + +摘要:在重症監護環境中預先預測醫療事件對於患者的預後和資源管理至關重要。利用預測模型,醫療保健提供者可以在心臟驟停、敗血症或呼吸衰竭等問題發生之前預測到這些問題。最近,專注於在臨床表現之前使用機器學習預測不良醫療事件發生的研究激增。然而,儘管這些模型為特定不良事件在定義的時間間隔內發生提供了時間預後預測,但它們的可解釋性仍然是一個挑戰。在這項工作中,我們探討了神經時間點過程在不良事件發作預測中的適用性,目的是解釋臨床途徑並提供可解釋的見解。我們的實驗涵蓋了六種最先進的神經點過程和六個重症監護資料集,每個資料集都專注於不同不良事件的發作。這項工作代表了神經時間點過程在事件預測中的一種新的應用類別。 + +##### **SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?** +2502.13233v1 by Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu + +Large Language Models (LLMs) have shown remarkable capabilities in general +domains but often struggle with tasks requiring specialized knowledge. +Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve +external information from static knowledge bases, which can be outdated or +incomplete, missing fine-grained clinical details essential for accurate +medical question answering. In this work, we propose SearchRAG, a novel +framework that overcomes these limitations by leveraging real-time search +engines. Our method employs synthetic query generation to convert complex +medical questions into search-engine-friendly queries and utilizes +uncertainty-based knowledge selection to filter and incorporate the most +relevant and informative medical knowledge into the LLM's input. Experimental +results demonstrate that our method significantly improves response accuracy in +medical question answering tasks, particularly for complex questions requiring +detailed and up-to-date knowledge. + +摘要:大型語言模型 (LLM) 在一般領域展現出驚人的能力,但經常在需要專業知識的任務中掙扎。 +傳統的檢索增強生成 (RAG) 技術通常從靜態知識庫中檢索外部資訊,這些資訊可能過時或不完整,缺少準確回答醫療問題所需的細微臨床細節。在這項工作中,我們提出 SearchRAG,這是一種新穎的架構,透過利用即時搜尋引擎克服這些限制。我們的模型採用合成查詢生成,將複雜的醫療問題轉換成搜尋引擎友善的查詢,並利用基於不確定性的知識選擇來過濾和納入 LLM 輸入中最相關且最有資訊的醫療知識。實驗結果證明,我們的模型顯著改善了醫療問題回答任務中的回應準確度,特別是需要詳細且最新的知識的複雜問題。 + +##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions** +2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić + +We present an end-to-end framework for generating synthetic users for +evaluating interactive agents designed to encourage positive behavior changes, +such as in health and lifestyle coaching. The synthetic users are grounded in +health and lifestyle conditions, specifically sleep and diabetes management in +this study, to ensure realistic interactions with the health coaching agent. +Synthetic users are created in two stages: first, structured data are generated +grounded in real-world health and lifestyle factors in addition to basic +demographics and behavioral attributes; second, full profiles of the synthetic +users are developed conditioned on the structured data. Interactions between +synthetic users and the coaching agent are simulated using generative +agent-based models such as Concordia, or directly by prompting a language +model. Using two independently-developed agents for sleep and diabetes coaching +as case studies, the validity of this framework is demonstrated by analyzing +the coaching agent's understanding of the synthetic users' needs and +challenges. Finally, through multiple blinded evaluations of user-coach +interactions by human experts, we demonstrate that our synthetic users with +health and behavioral attributes more accurately portray real human users with +the same attributes, compared to generic synthetic users not grounded in such +attributes. The proposed framework lays the foundation for efficient +development of conversational agents through extensive, realistic, and grounded +simulated interactions. + +摘要:我們提供了一個端到端的架構,用於為評估互動式代理生成合成使用者,這些代理旨在鼓勵正向行為改變,例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎,特別是本研究中的睡眠和糖尿病管理,以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立:首先,除了基本人口統計資料和行為屬性外,還會產生以現實世界的健康和生活方式因素為基礎的結構化資料;其次,會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型(例如 Concordia)模擬的,或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究,通過分析指導代理對合成使用者需求和挑戰的理解,證明了此架構的有效性。最後,通過人類專家對使用者指導互動進行多重盲測評估,我們證明了與未以這些屬性為基礎的通用合成使用者相比,具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動,為對話代理的有效開發奠定了基礎。 + +##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization** +2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar + +Clinical Question Answering (CQA) plays a crucial role in medical +decision-making, enabling physicians to extract relevant information from +Electronic Medical Records (EMRs). While transformer-based models such as BERT, +BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in +CQA, existing models lack the ability to categorize extracted answers, which is +critical for structured retrieval, content filtering, and medical decision +support. + To address this limitation, we introduce a Multi-Task Learning (MTL) +framework that jointly trains CQA models for both answer extraction and medical +categorization. In addition to predicting answer spans, our model classifies +responses into five standardized medical categories: Diagnosis, Medication, +Symptoms, Procedure, and Lab Reports. This categorization enables more +structured and interpretable outputs, making clinical QA models more useful in +real-world healthcare settings. + We evaluate our approach on emrQA, a large-scale dataset for medical question +answering. Results show that MTL improves F1-score by 2.2% compared to standard +fine-tuning, while achieving 90.7% accuracy in answer categorization. These +findings suggest that MTL not only enhances CQA performance but also introduces +an effective mechanism for categorization and structured medical information +retrieval. + +摘要:臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色,讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能,但現有的模型缺乏分類擷取答案的能力,這對於結構化檢索、內容過濾和醫療決策支援至關重要。 + 為了解決這個限制,我們引進了一個多任務學習 (MTL) 架構,它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍,我們的模型將回應分類為五個標準化醫療類別:診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出,讓臨床問答模型在真實世界的醫療保健環境中更實用。 + 我們在 emrQA 上評估我們的做法,emrQA 是用於醫療問題解答的大規模資料集。結果顯示,與標準微調相比,MTL 將 F1 分數提高了 2.2%,同時在答案分類中達到 90.7% 的準確度。這些發現表明,MTL 不僅增強了 CQA 的效能,還引入了一種分類和結構化醫療資訊檢索的有效機制。 + +##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection** +2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert + +Detection of hyperenhancement from cardiac LGE MRI images is a complex task +requiring significant clinical expertise. Although deep learning-based models +have shown promising results for the task, they require large amounts of data +with fine-grained annotations. Clinical reports generated for cardiac MR +studies contain rich, clinically relevant information, including the location, +extent and etiology of any scars present. Although recently developed +CLIP-based training enables pretraining models with image-text pairs, it +requires large amounts of data and further finetuning strategies on downstream +tasks. In this study, we use various strategies rooted in domain knowledge to +train a model for LGE detection solely using text from clinical reports, on a +relatively small clinical cohort of 965 patients. We improve performance +through the use of synthetic data augmentation, by systematically creating scar +images and associated text. In addition, we standardize the orientation of the +images in an anatomy-informed way to enable better alignment of spatial and +text features. We also use a captioning loss to enable fine-grained supervision +and explore the effect of pretraining of the vision encoder on performance. +Finally, ablation studies are carried out to elucidate the contributions of +each design component to the overall performance of the model. + +摘要:從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務,需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果,但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊,包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型,但它需要大量資料和進一步微調下游任務的策略。在這項研究中,我們使用植基於領域知識的各種策略,僅使用來自臨床報告的文字,在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能,系統性地建立疤痕影像和相關文字。此外,我們以解剖學告知的方式標準化影像方向,以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督,並探討視覺編碼器的預訓練對效能的影響。最後,進行消融研究以闡明每個設計元件對模型整體效能的貢獻。 + +##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models** +2502.12825v2 by Rubing Li, João Sedoc, Arun Sundararajan + +When encountering increasingly frequent performance improvements or cost +reductions from a new large language model (LLM), developers of applications +leveraging LLMs must decide whether to take advantage of these improvements or +stay with older tried-and-tested models. Low perceived switching frictions can +lead to choices that do not consider more subtle behavior changes that the +transition may induce. Our experiments use a popular game-theoretic behavioral +economics model of trust to show stark differences in the trusting behavior of +OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust +behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing +and risk-seeking with future returns from trust, and contrast it with +DeepSeek's more sophisticated and profitable trusting behavior that stems from +an ability to incorporate deeper concepts like forward planning and +theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our +results highlight the perils of relying on LLM performance benchmarks that are +too narrowly defined and suggest that careful analysis of their hidden fault +lines should be part of any organization's AI strategy. + +摘要:在遇到大型語言模型 (LLM) 頻頻帶來的效能提升或成本降低時,利用 LLM 的應用程式開發人員必須決定是否要利用這些提升,或繼續使用較舊且經過驗證的模型。低感知切換摩擦可能會導致選擇,而沒有考慮轉換可能引發的更細微行為變更。我們的實驗使用流行的博弈論行為經濟信任模型,以顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰,因為它們調和了利潤最大化和冒險,以及來自信任的未來回報,並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比,這種行為源於整合更深入的概念,例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎,我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險,並建議仔細分析其隱藏的斷層線應該是任何組織 AI 策略的一部分。 + +##### **LLM Safety for Children** +2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat + +This paper analyzes the safety of Large Language Models (LLMs) in +interactions with children below age of 18 years. Despite the transformative +applications of LLMs in various aspects of children's lives such as education +and therapy, there remains a significant gap in understanding and mitigating +potential content harms specific to this demographic. The study acknowledges +the diverse nature of children often overlooked by standard safety evaluations +and proposes a comprehensive approach to evaluating LLM safety specifically for +children. We list down potential risks that children may encounter when using +LLM powered applications. Additionally we develop Child User Models that +reflect the varied personalities and interests of children informed by +literature in child care and psychology. These user models aim to bridge the +existing gap in child safety literature across various fields. We utilize Child +User Models to evaluate the safety of six state of the art LLMs. Our +observations reveal significant safety gaps in LLMs particularly in categories +harmful to children but not adults + +摘要:本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面(例如教育和治療)都有轉變性的應用,但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性,而標準安全評估通常會忽略這些多樣性,並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外,我們開發了兒童使用者模型,這些模型反映了兒童不同的個性特質和興趣,並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞,特別是在對兒童有害但對成年人無害的類別中 + +##### **Classifiers of Data Sharing Statements in Clinical Trial Records** +2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth + +Digital individual participant data (IPD) from clinical trials are +increasingly distributed for potential scientific reuse. The identification of +available IPD, however, requires interpretations of textual data-sharing +statements (DSS) in large databases. Recent advancements in computational +linguistics include pre-trained language models that promise to simplify the +implementation of effective classifiers based on textual inputs. In a subset of +5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers +based on domain-specific pre-trained language models reproduce original +availability categories as well as manually annotated labels. Typical metrics +indicate that classifiers that predicted manual annotations outperformed those +that learned to output the original availability categories. This suggests that +the textual DSS descriptions contain applicable information that the +availability categories do not, and that such classifiers could thus aid the +automatic identification of available IPD in large trial databases. + +摘要:臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而,要找出可用的 IPD,需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型,有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中,我們評估了基於特定領域預先訓練語言模型的分類器,在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示,預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊,而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。 + +##### **Relational Norms for Human-AI Cooperation** +2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark + +How we should design and interact with social artificial intelligence depends +on the socio-relational role the AI is meant to emulate or occupy. In human +society, relationships such as teacher-student, parent-child, neighbors, +siblings, or employer-employee are governed by specific norms that prescribe or +proscribe cooperative functions including hierarchy, care, transaction, and +mating. These norms shape our judgments of what is appropriate for each +partner. For example, workplace norms may allow a boss to give orders to an +employee, but not vice versa, reflecting hierarchical and transactional +expectations. As AI agents and chatbots powered by large language models are +increasingly designed to serve roles analogous to human positions - such as +assistant, mental health provider, tutor, or romantic partner - it is +imperative to examine whether and how human relational norms should extend to +human-AI interactions. Our analysis explores how differences between AI systems +and humans, such as the absence of conscious experience and immunity to +fatigue, may affect an AI's capacity to fulfill relationship-specific functions +and adhere to corresponding norms. This analysis, which is a collaborative +effort by philosophers, psychologists, relationship scientists, ethicists, +legal experts, and AI researchers, carries important implications for AI +systems design, user behavior, and regulation. While we accept that AI systems +can offer significant benefits such as increased availability and consistency +in certain socio-relational roles, they also risk fostering unhealthy +dependencies or unrealistic expectations that could spill over into human-human +relationships. We propose that understanding and thoughtfully shaping (or +implementing) suitable human-AI relational norms will be crucial for ensuring +that human-AI interactions are ethical, trustworthy, and favorable to human +well-being. + +摘要:我們應如何設計和與社交人工智慧互動,取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中,師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配,這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如,職場規範可能允許老闆對員工發號施令,但反之則不行,這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色,例如助理、心理健康提供者、導師或浪漫伴侶,審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異,例如缺乏意識體驗和對疲勞的免疫力,如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果,對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處,例如增加可用性和一致性,但它們也可能助長不健康的依賴關係或不切實際的期望,這些期望可能會蔓延到人際關係中。我們提出,理解和深思熟慮地塑造(或實施)適當的人類與人工智慧關係規範,對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。 + +##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis** +2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy + +Air quality prediction is key to mitigating health impacts and guiding +decisions, yet existing models tend to focus on temporal trends while +overlooking spatial generalization. We propose AQ-Net, a spatiotemporal +reanalysis model for both observed and unobserved stations in the near future. +AQ-Net utilizes the LSTM and multi-head attention for the temporal regression. +We also propose a cyclic encoding technique to ensure continuous time +representation. To learn fine-grained spatial air quality estimation, we +incorporate AQ-Net with the neural kNN to explore feature-based interpolation, +such that we can fill the spatial gaps given coarse observation stations. To +demonstrate the efficiency of our model for spatiotemporal reanalysis, we use +data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive +experiments show that AQ-Net excels in air quality reanalysis, highlighting the +potential of hybrid spatio-temporal models to better capture environmental +dynamics, especially in urban areas where both spatial and temporal variability +are critical. + +摘要:空气品质预测是减轻健康影响和指导决策的关键,但现有的模型倾向于关注时间趋势,而忽略空间概化。我们提出了 AQ-Net,这是一种时空再分析模型,适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计,我们将 AQ-Net 与神经 kNN 结合起来,以探索基于特征的插值,以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率,我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明,AQ-Net 在空气质量再分析中表现出色,突出了混合时空模型在更好地捕捉环境动态方面的潜力,尤其是在空间和时间变异性都很关键的城市地区。 + +##### **Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing** +2502.11715v1 by Site Qu, Guoqiang Hu + +The Location-Routing Problem (LRP), which combines the challenges of facility +(depot) locating and vehicle route planning, is critically constrained by the +reliance on predefined depot candidates, limiting the solution space and +potentially leading to suboptimal outcomes. Previous research on LRP without +predefined depots is scant and predominantly relies on heuristic algorithms +that iteratively attempt depot placements across a planar area. Such approaches +lack the ability to proactively generate depot locations that meet specific +geographic requirements, revealing a notable gap in current research landscape. +To bridge this gap, we propose a data-driven generative DRL framework, designed +to proactively generate depots for LRP without predefined depot candidates, +solely based on customer requests data which include geographic and demand +information. It can operate in two distinct modes: direct generation of exact +depot locations, and the creation of a multivariate Gaussian distribution for +flexible depots sampling. By extracting depots' geographic pattern from +customer requests data, our approach can dynamically respond to logistical +needs, identifying high-quality depot locations that further reduce total +routing costs compared to traditional methods. Extensive experiments +demonstrate that, for a same group of customer requests, compared with those +depots identified through random attempts, our framework can proactively +generate depots that lead to superior solution routes with lower routing cost. +The implications of our framework potentially extend into real-world +applications, particularly in emergency medical rescue and disaster relief +logistics, where rapid establishment and adjustment of depot locations are +paramount, showcasing its potential in addressing LRP for dynamic and +unpredictable environments. + +摘要:地點路線問題(LRP)結合了設施(倉庫)定位和車輛路線規劃的挑戰,嚴重受到預先定義的倉庫候選限制,限制了解決方案空間,並可能導致次優結果。先前關於沒有預先定義倉庫的 LRP 研究很少,而且主要依賴於啟發式演算法,在平面區域中反覆嘗試倉庫配置。這種方法無法主動產生符合特定地理需求的倉庫位置,顯示了當前研究領域的顯著差距。為了彌補這個差距,我們提出一個資料驅動的生成式 DRL 架構,旨在主動為 LRP 產生倉庫,而無需預先定義的倉庫候選,僅根據包含地理和需求資訊的客戶要求資料。它可以在兩種不同的模式下運作:直接產生確切的倉庫位置,以及建立多元高斯分布以進行彈性倉庫抽樣。透過從客戶要求資料中提取倉庫的地理模式,我們的方法可以動態回應後勤需求,找出高品質的倉庫位置,進一步降低與傳統方法相比的總路線成本。廣泛的實驗證明,對於同一組客戶要求,與透過隨機嘗試識別的那些倉庫相比,我們的架構可以主動產生倉庫,並產生路線成本較低的優質解決方案路線。我們的架構的影響潛在地擴展到實際應用,特別是在緊急醫療救援和災害救災後勤方面,其中倉庫位置的快速建立和調整至關重要,展示了其在解決動態和不可預測環境的 LRP 中的潛力。 + +##### **LLM Agents Making Agent Tools** +2502.11705v1 by Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather + +Tool use has turned large language models (LLMs) into powerful agents that +can perform complex multi-step tasks by dynamically utilising external software +components. However, these tools must be implemented in advance by human +developers, hindering the applicability of LLM agents in domains which demand +large numbers of highly specialised tools, like in life sciences and medicine. +Motivated by the growing trend of scientific studies accompanied by public code +repositories, we propose ToolMaker, a novel agentic framework that autonomously +transforms papers with code into LLM-compatible tools. Given a short task +description and a repository URL, ToolMaker autonomously installs required +dependencies and generates code to perform the task, using a closed-loop +self-correction mechanism to iteratively diagnose and rectify errors. To +evaluate our approach, we introduce a benchmark comprising 15 diverse and +complex computational tasks spanning both medical and non-medical domains with +over 100 unit tests to objectively assess tool correctness and robustness. +ToolMaker correctly implements 80% of the tasks, substantially outperforming +current state-of-the-art software engineering agents. ToolMaker therefore is a +step towards fully autonomous agent-based scientific workflows. + +摘要:工具使用已將大型語言模型 (LLM) 轉變為強大的代理,可透過動態使用外部軟體元件來執行複雜的多步驟任務。然而,這些工具必須事先由人類開發人員實作,這會阻礙 LLM 代理在需要大量高度專業化工具的領域(例如生命科學和醫學)中的應用性。受到伴隨公開程式碼儲存庫的科學研究趨勢所啟發,我們提出 ToolMaker,一個創新的代理架構,可自主地將帶有程式碼的論文轉換為相容於 LLM 的工具。給定簡短的任務描述和儲存庫網址,ToolMaker 會自主安裝所需的依賴項,並產生程式碼來執行任務,使用閉環自我修正機制來反覆診斷和糾正錯誤。為了評估我們的做法,我們引進一個包含 15 個不同且複雜的運算任務的基準,涵蓋醫療和非醫療領域,並包含超過 100 個單元測試,以客觀評估工具的正確性和穩健性。ToolMaker 正確實作了 80% 的任務,大幅優於目前的最新軟體工程代理。因此,ToolMaker 是邁向完全自主的基於代理的科學工作流程的一步。 + +##### **MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression** +2502.11651v1 by Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang + +Large vision-language models (LVLMs) have shown great promise in medical +applications, particularly in visual question answering (MedVQA) and diagnosis +from medical images. However, existing datasets and models often fail to +consider critical aspects of medical diagnostics, such as the integration of +historical records and the analysis of disease progression over time. In this +paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel +dataset for MedVQA that focuses on identifying changes in specific regions +between two patient visits. Unlike previous datasets that primarily address +single-image questions, MMXU enables multi-image questions, incorporating both +current and historical patient data. We demonstrate the limitations of current +LVLMs in identifying disease progression on MMXU-\textit{test}, even those that +perform well on traditional benchmarks. To address this, we propose a +MedRecord-Augmented Generation (MAG) approach, incorporating both global and +regional historical records. Our experiments show that integrating historical +records significantly enhances diagnostic accuracy by at least 20\%, bridging +the gap between current LVLMs and human expert performance. Additionally, we +fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable +improvements. We hope this work could illuminate the avenue of advancing the +use of LVLMs in medical diagnostics by emphasizing the importance of historical +context in interpreting medical images. Our dataset is released at +\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU}. + +摘要:大型視覺語言模型 (LVLMs) 已在醫療應用中展現出極大的潛力,特別是在視覺問答 (MedVQA) 和醫學影像診斷方面。然而,現有的資料集和模型常常無法考量醫療診斷的關鍵層面,例如病歷整合以及隨著時間推移對疾病進程的分析。在本文中,我們介紹 MMXU(多模態多 X 光理解),一個專注於識別兩次患者就診之間特定區域變化的 MedVQA 新資料集。與主要處理單一影像問題的先前資料集不同,MMXU 支援多影像問題,同時納入當前和病史患者資料。我們展示了現有 LVLMs 在 MMXU-\textit{test} 中識別疾病進程的限制,即使是在傳統基準測試中表現良好的 LVLMs 也是如此。為了解決這個問題,我們提出了一個病歷增強生成 (MAG) 方法,結合了全域和區域病史。我們的實驗顯示,整合病歷可顯著提升至少 20% 的診斷準確度,縮小了現有 LVLMs 和人類專家表現之間的差距。此外,我們在 MMXU-\textit{dev} 上微調帶有 MAG 的模型,這展示了顯著的進步。我們希望這項工作能透過強調病史脈絡在解讀醫學影像中的重要性,為推進 LVLMs 在醫療診斷中的應用開闢道路。我們的資料集已於\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU} 發布。 + +##### **A Survey of Personalized Large Language Models: Progress and Future Directions** +2502.11528v1 by Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Jieming Zhu, Minda Hu, Menglin Yang, Irwin King + +Large Language Models (LLMs) excel in handling general knowledge tasks, yet +they struggle with user-specific personalization, such as understanding +individual emotions, writing styles, and preferences. Personalized Large +Language Models (PLLMs) tackle these challenges by leveraging individual user +data, such as user profiles, historical dialogues, content, and interactions, +to deliver responses that are contextually relevant and tailored to each user's +specific needs. This is a highly valuable research topic, as PLLMs can +significantly enhance user satisfaction and have broad applications in +conversational agents, recommendation systems, emotion recognition, medical +assistants, and more. This survey reviews recent advancements in PLLMs from +three technical perspectives: prompting for personalized context (input level), +finetuning for personalized adapters (model level), and alignment for +personalized preferences (objective level). To provide deeper insights, we also +discuss current limitations and outline several promising directions for future +research. Updated information about this survey can be found at the +https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models. + +摘要:大型語言模型 (LLM) 在處理一般知識任務方面表現出色,但 +它們在使用者特定的個人化方面有困難,例如理解 +個別的情緒、寫作風格和偏好。個人化大型 +語言模型 (PLLM) 透過利用個別使用者的 +資料來解決這些挑戰,例如使用者個人資料、歷史對話、內容和互動, +提供在脈絡上相關且針對每個使用者的特定需求量身打造的回應。這是一個非常有價值的研究主題,因為 PLLM 可以 +顯著提升使用者滿意度,並在對話代理、推薦系統、情緒辨識、醫療 +助理等方面有廣泛的應用。這項調查從三個技術觀點回顧 PLLM 的最新進展:提示個人化脈絡(輸入層級)、微調個人化適配器(模型層級),以及對齊個人化偏好(目標層級)。為了提供更深入的見解,我們也 +討論目前的限制,並概述未來研究的幾個有希望的方向。這項調查的最新資訊可以在 +https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models 找到。 + +##### **Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos** +2502.11481v1 by Xiangxiang Cui, Zhongyu Li, Xiayue Fan, Peng Huang, Ying Wang, Meng Yang, Shi Chang, Jihua Zhu + +The intersection of medical imaging and artificial intelligence has become an +important research direction in intelligent medical treatment, particularly in +the analysis of medical images using deep learning for clinical diagnosis. +Despite the advances, existing keyframe classification methods lack extraction +of time series features, while ultrasonic video classification based on +three-dimensional convolution requires uniform frame numbers across patients, +resulting in poor feature extraction efficiency and model classification +performance. This study proposes a novel video classification method based on +CNN and LSTM, introducing NLP's long and short sentence processing scheme into +video classification for the first time. The method reduces CNN-extracted image +features to 1x512 dimension, followed by sorting and compressing feature +vectors for LSTM training. Specifically, feature vectors are sorted by patient +video frame numbers and populated with padding value 0 to form variable +batches, with invalid padding values compressed before LSTM training to +conserve computing resources. Experimental results demonstrate that our +variable-frame CNNLSTM method outperforms other approaches across all metrics, +showing improvements of 3-6% in F1 score and 1.5% in specificity compared to +keyframe methods. The variable-frame CNNLSTM also achieves better accuracy and +precision than equal-frame CNNLSTM. These findings validate the effectiveness +of our approach in classifying variable-frame ultrasound videos and suggest +potential applications in other medical imaging modalities. + +摘要:醫學影像與人工智慧的交叉領域已成為智慧醫療的重要研究方向,特別是在臨床診斷中使用深度學習分析醫學影像。儘管有進展,現有的關鍵影格分類方法缺乏時間序列特徵的提取,而基於三維卷積的超音波影片分類需要患者之間的均勻影格數,導致特徵提取效率差和模型分類效能不佳。本研究提出了一種基於 CNN 和 LSTM 的新影片分類方法,首次將 NLP 的長短句處理機制引入影片分類中。該方法將 CNN 提取的影像特徵縮減為 1x512 維度,然後對特徵向量進行排序和壓縮以進行 LSTM 訓練。具體來說,特徵向量按患者影片影格數排序,並填充 0 補齊值以形成可變批次,在 LSTM 訓練前壓縮無效的補齊值以節省運算資源。實驗結果表明,我們的可變影格 CNNLSTM 方法在所有指標上都優於其他方法,與關鍵影格方法相比,F1 分數提高了 3-6%,特異性提高了 1.5%。可變影格 CNNLSTM 也比等影格 CNNLSTM 達到了更好的準確度和精確度。這些發現驗證了我們的方法在分類可變影格超音波影片中的有效性,並表明在其他醫學影像模式中具有潛在的應用。 + +##### **Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation** +2502.11456v1 by Yanyan Wang, Kechen Song, Yuyuan Liu, Shuai Ma, Yunhui Yan, Gustavo Carneiro + +Semi-supervised 3D medical image segmentation aims to achieve accurate +segmentation using few labelled data and numerous unlabelled data. The main +challenge in the design of semi-supervised learning methods consists in the +effective use of the unlabelled data for training. A promising solution +consists of ensuring consistent predictions across different views of the data, +where the efficacy of this strategy depends on the accuracy of the +pseudo-labels generated by the model for this consistency learning strategy. In +this paper, we introduce a new methodology to produce high-quality +pseudo-labels for a consistency learning strategy to address semi-supervised 3D +medical image segmentation. The methodology has three important contributions. +The first contribution is the Cooperative Rectification Learning Network (CRLN) +that learns multiple prototypes per class to be used as external knowledge +priors to adaptively rectify pseudo-labels at the voxel level. The second +contribution consists of the Dynamic Interaction Module (DIM) to facilitate +pairwise and cross-class interactions between prototypes and multi-resolution +image features, enabling the production of accurate voxel-level clues for +pseudo-label rectification. The third contribution is the Cooperative Positive +Supervision (CPS), which optimises uncertain representations to align with +unassertive representations of their class distributions, improving the model's +accuracy in classifying uncertain regions. Extensive experiments on three +public 3D medical segmentation datasets demonstrate the effectiveness and +superiority of our semi-supervised learning method. + +摘要:半监督 3D 医学影像分割旨在使用少量标记数据和大量未标记数据实现精确分割。半监督学习方法设计中的主要挑战在于有效使用未标记数据进行训练。一个有前景的解决方案是确保数据不同视图之间预测的一致性,其中此策略的有效性取决于模型为这种一致性学习策略生成的伪标签的准确性。在本文中,我们引入了一种新的方法来为一致性学习策略生成高质量的伪标签,以解决半监督 3D 医学图像分割问题。该方法有三个重要的贡献。第一个贡献是协作修正学习网络 (CRLN),它为每个类别学习多个原型,用作外部知识先验,以在体素级别自适应地修正伪标签。第二个贡献包括动态交互模块 (DIM),以促进原型和多分辨率图像特征之间的成对和跨类交互,从而能够生成用于伪标签修正的准确体素级线索。第三个贡献是协作正监督 (CPS),它优化不确定的表示以与其类分布的不确定表示保持一致,从而提高模型对不确定区域进行分类的准确性。在三个公共 3D 医学分割数据集上进行的大量实验表明了我们半监督学习方法的有效性和优越性。 + +##### **A Survey of LLM-based Agents in Medicine: How far are we from Baymax?** +2502.11211v1 by Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Wenting Chen, Xiang Li, Yixuan Yuan + +Large Language Models (LLMs) are transforming healthcare through the +development of LLM-based agents that can understand, reason about, and assist +with medical tasks. This survey provides a comprehensive review of LLM-based +agents in medicine, examining their architectures, applications, and +challenges. We analyze the key components of medical agent systems, including +system profiles, clinical planning mechanisms, medical reasoning frameworks, +and external capacity enhancement. The survey covers major application +scenarios such as clinical decision support, medical documentation, training +simulations, and healthcare service optimization. We discuss evaluation +frameworks and metrics used to assess these agents' performance in healthcare +settings. While LLM-based agents show promise in enhancing healthcare delivery, +several challenges remain, including hallucination management, multimodal +integration, implementation barriers, and ethical considerations. The survey +concludes by highlighting future research directions, including advances in +medical reasoning inspired by recent developments in LLM architectures, +integration with physical systems, and improvements in training simulations. +This work provides researchers and practitioners with a structured overview of +the current state and future prospects of LLM-based agents in medicine. + +摘要:大型語言模型 (LLM) 透過開發可理解、推理並協助醫療任務的 LLM 基礎代理人,轉變了醫療保健。本調查提供了 LLM 基礎代理人在醫學中的全面回顧,探討其架構、應用和挑戰。我們分析了醫療代理系統的主要組成部分,包括系統概況、臨床規劃機制、醫療推理架構和外部能力提升。本調查涵蓋了主要的應用場景,例如臨床決策支援、醫療文件、訓練模擬和醫療保健服務最佳化。我們討論了用於評估這些代理人在醫療保健環境中表現的評估架構和指標。雖然 LLM 基礎代理人顯示出在增強醫療保健提供方面的潛力,但仍有許多挑戰,包括幻覺管理、多模態整合、實施障礙和倫理考量。本調查最後強調了未來的研究方向,包括受 LLM 架構近期發展啟發的醫療推理進展、與物理系統的整合和訓練模擬的改進。這項工作為研究人員和從業人員提供了 LLM 基礎代理人在醫學中當前狀態和未來前景的結構化概觀。 + +##### **RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer** +2502.11179v1 by Shilong Yang, Qi Zang, Chulong Zhang, Lingfeng Huang, Yaoqin Xie + +Traditional Chinese acupuncture methods often face controversy in clinical +practice due to their high subjectivity. Additionally, current +intelligent-assisted acupuncture systems have two major limitations: slow +acupoint localization speed and low accuracy. To address these limitations, a +new method leverages the excellent inference efficiency of the state-space +model Mamba, while retaining the advantages of the attention mechanism in the +traditional DETR architecture, to achieve efficient global information +integration and provide high-quality feature information for acupoint +localization tasks. Furthermore, by employing the concept of residual +likelihood estimation, it eliminates the need for complex upsampling processes, +thereby accelerating the acupoint localization task. Our method achieved +state-of-the-art (SOTA) accuracy on a private dataset of acupoints on the human +back, with an average Euclidean distance pixel error (EPE) of 7.792 and an +average time consumption of 10.05 milliseconds per localization task. Compared +to the second-best algorithm, our method improved both accuracy and speed by +approximately 14\%. This significant advancement not only enhances the efficacy +of acupuncture treatment but also demonstrates the commercial potential of +automated acupuncture robot systems. Access to our method is available at +https://github.com/Sohyu1/RT-DEMT + +摘要:傳統的中醫針灸方法由於其高度主觀性,在臨床實務中經常面臨爭議。此外,現有的智慧輔助針灸系統有兩大限制:取穴速度慢以及準確度低。為了解決這些限制,一種新的方法利用了狀態空間模型 Mamba 優異的推理效率,同時保留了傳統 DETR 架構中注意力機制的優點,以實現高效的全局資訊整合,並為取穴任務提供高品質的特徵資訊。此外,透過採用殘差似然估計的概念,它消除了對複雜上採樣程序的需求,從而加速了取穴任務。我們的模型在人體背部穴位私人資料集上達到了最先進 (SOTA) 的準確度,平均歐幾里得距離像素誤差 (EPE) 為 7.792,平均每個取穴任務耗時 10.05 毫秒。與第二好的演算法相比,我們的模型在準確度和速度上都提高了大約 14%。這項重大進展不僅提高了針灸治療的療效,也證明了自動化針灸機器人系統的商業潛力。我們的模型可以在 https://github.com/Sohyu1/RT-DEMT 取得 + +##### **Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications** +2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy + +Large language models (LLMs) have significantly advanced the field of natural +language generation. However, they frequently generate unverified outputs, +which compromises their reliability in critical applications. In this study, we +propose an innovative framework that combines structured biomedical knowledge +with LLMs through a retrieval-augmented generation technique. Our system +develops a thorough knowledge graph by identifying and refining causal +relationships and named entities from medical abstracts related to age-related +macular degeneration (AMD). Using a vector-based retrieval process and a +locally deployed language model, our framework produces responses that are both +contextually relevant and verifiable, with direct references to clinical +evidence. Experimental results show that this method notably decreases +hallucinations, enhances factual precision, and improves the clarity of +generated responses, providing a robust solution for advanced biomedical +chatbot applications. + +摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。 + +##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration** +2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang + +Automatic depression detection provides cues for early clinical intervention +by clinicians. Clinical interviews for depression detection involve dialogues +centered around multiple themes. Existing studies primarily design end-to-end +neural network models to capture the hierarchical structure of clinical +interview dialogues. However, these methods exhibit defects in modeling the +thematic content of clinical interviews: 1) they fail to capture intra-theme +and inter-theme correlation explicitly, and 2) they do not allow clinicians to +intervene and focus on themes of interest. To address these issues, this paper +introduces an interactive depression detection framework. This framework +leverages in-context learning techniques to identify themes in clinical +interviews and then models both intra-theme and inter-theme correlation. +Additionally, it employs AI-driven feedback to simulate the interests of +clinicians, enabling interactive adjustment of theme importance. PDIMC achieves +absolute improvements of 35\% and 12\% compared to the state-of-the-art on the +depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of +modeling theme correlation and incorporating interactive external feedback. + +摘要:自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而,這些方法在建模臨床訪談的主題內容時表現出缺陷:1)它們無法明確捕捉主題內和主題間的關聯性,以及 2)它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題,本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題,然後對主題內和主題間的關聯性進行建模。此外,它採用 AI 驅動的回饋來模擬臨床醫師的興趣,實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比,PDIMC 的絕對改進率分別為 35% 和 12%,這證明了對主題關聯性建模和納入互動式外部回饋的有效性。 + +##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening** +2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu + +Due to the rise in antimicrobial resistance, identifying novel compounds with +antibiotic potential is crucial for combatting this global health issue. +However, traditional drug development methods are costly and inefficient. +Recognizing the pressing need for more effective solutions, researchers have +turned to machine learning techniques to streamline the prediction and +development of novel antibiotic compounds. While foundation models have shown +promise in antibiotic discovery, current mainstream efforts still fall short of +fully leveraging the potential of multimodal molecular data. Recent studies +suggest that contrastive learning frameworks utilizing multimodal data exhibit +excellent performance in representation learning across various domains. +Building upon this, we introduce CL-MFAP, an unsupervised contrastive learning +(CL)-based multimodal foundation (MF) model specifically tailored for +discovering small molecules with potential antibiotic properties (AP) using +three types of molecular data. This model employs 1.6 million bioactive +molecules with drug-like properties from the ChEMBL dataset to jointly pretrain +three encoders: (1) a transformer-based encoder with rotary position embedding +for processing SMILES strings; (2) another transformer-based encoder, +incorporating a novel bi-level routing attention mechanism to handle molecular +graph representations; and (3) a Morgan fingerprint encoder using a multilayer +perceptron, to achieve the contrastive learning purpose. The CL-MFAP +outperforms baseline models in antibiotic property prediction by effectively +utilizing different molecular modalities and demonstrates superior +domain-specific performance when fine-tuned for antibiotic-related property +prediction tasks. + +摘要:由於抗菌藥物抗性上升,找出具有抗生素潛力的新型化合物對於對抗此項全球性健康議題至關重要。不過,傳統的藥物開發方法成本高昂且效率不彰。研究人員體認到對於更有效解決方案的迫切需求,因此轉向機器學習技術來簡化新型抗生素化合物的預測和開發。儘管基礎模型在抗生素發現方面展現潛力,目前的普遍做法仍未充分利用多模態分子資料的潛力。最近的研究顯示,利用多模態資料的對比學習架構在各種領域的表徵學習中展現出優異的效能。有鑑於此,我們引進 CL-MFAP,一種無監督對比學習 (CL) 為基礎的多模態基礎 (MF) 模型,專門用於使用三種類型的分子資料發現具有潛在抗生素特性的低分子。此模型採用 ChEMBL 資料集中的 160 萬個具有類藥物特性的生物活性分子,以聯合預訓練三個編碼器:(1) 一個具有旋轉位置嵌入的基於Transformer的編碼器,用於處理 SMILES 字串;(2) 另一個基於Transformer的編碼器,結合一種新穎的雙層路由注意機制來處理分子圖表表徵;以及 (3) 一個使用多層感知器的 Morgan 指紋編碼器,以達成對比學習的目的。CL-MFAP 透過有效利用不同的分子模式在抗生素特性預測方面優於基準模型,並且在針對抗生素相關特性預測任務進行微調時展現出優異的特定領域效能。 + +##### **Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images** +2502.10908v1 by Sevim Cengiz, Ibraheem Hamdi, Mohammad Yaqub + +Fetal gestational age (GA) is vital clinical information that is estimated +during pregnancy in order to assess fetal growth. This is usually performed by +measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan +which is then correlated with fetal age and growth trajectory. A major issue +when performing the CRL measurement is ensuring that the image is acquired at +the correct view, otherwise it could be misleading. Although clinical +guidelines specify the criteria for the correct CRL view, sonographers may not +regularly adhere to such rules. In this paper, we propose a new deep +learning-based solution that is able to verify the adherence of a CRL image to +clinical guidelines in order to assess image quality and facilitate accurate +estimation of GA. We first segment out important fetal structures then use the +localized structures to perform a clinically-guided mapping that verifies the +adherence of criteria. The segmentation method combines the benefits of +Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment +fetal structures in ultrasound images and localize important fetal landmarks. +For segmentation purposes, we compare our proposed work with UNet and show that +our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, +we compare the output of the mapping with classification CNNs when assessing +the clinical criteria and the overall acceptability of CRL images. We show that +the proposed mapping is not only explainable but also more accurate than the +best performing classification CNNs. + +摘要:胎兒妊娠年齡 (GA) 是重要的臨床資訊,會在懷孕期間估計,以評估胎兒生長。這通常是透過在約會掃描中測量超音波影像中的頭臀長度 (CRL) 來執行,然後與胎兒年齡和生長軌跡相關聯。執行 CRL 測量時的一個主要問題是確保影像是在正確的視角下取得,否則可能會產生誤導。儘管臨床指南規定了正確 CRL 視角的標準,但超音波檢查員可能不會定期遵守這些規則。在本文中,我們提出了一個新的深度學習解決方案,能夠驗證 CRL 影像是否符合臨床指南,以評估影像品質並促進對 GA 的準確估計。我們首先分割出重要的胎兒結構,然後使用局部結構來執行臨床指導的對應,以驗證標準的遵守情況。分割方法結合了卷積神經網路 (CNN) 和視覺轉換器 (ViT) 的優點,以分割超音波影像中的胎兒結構並定位重要的胎兒標誌。為了分割目的,我們將我們提出的工作與 UNet 進行比較,並顯示我們基於 CNN/ViT 的方法優於 UNet 的最佳化版本。此外,我們在評估臨床標準和 CRL 影像的整體可接受性時,將對應的輸出與分類 CNN 進行比較。我們表明,所提出的對應不僅可以解釋,而且比效能最佳的分類 CNN 更準確。 + +##### **Breaking Down the Hierarchy: A New Approach to Leukemia Classification** +2502.10899v1 by Ibraheem Hamdi, Hosam El-Gendy, Ahmed Sharshar, Mohamed Saeed, Muhammad Ridzuan, Shahrukh K. Hashmi, Naveed Syed, Imran Mirza, Shakir Hussain, Amira Mahmoud Abdalla, Mohammad Yaqub + +The complexities inherent to leukemia, multifaceted cancer affecting white +blood cells, pose considerable diagnostic and treatment challenges, primarily +due to reliance on laborious morphological analyses and expert judgment that +are susceptible to errors. Addressing these challenges, this study presents a +refined, comprehensive strategy leveraging advanced deep-learning techniques +for the classification of leukemia subtypes. We commence by developing a +hierarchical label taxonomy, paving the way for differentiating between various +subtypes of leukemia. The research further introduces a novel hierarchical +approach inspired by clinical procedures capable of accurately classifying +diverse types of leukemia alongside reactive and healthy cells. An integral +part of this study involves a meticulous examination of the performance of +Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) as +classifiers. The proposed method exhibits an impressive success rate, achieving +approximately 90\% accuracy across all leukemia subtypes, as substantiated by +our experimental results. A visual representation of the experimental findings +is provided to enhance the model's explainability and aid in understanding the +classification process. + +摘要:白血病的复杂性源于它是一种影响白血球的多面性癌症,主要由于依赖费力的形态分析和容易出错的专家判断,因此带来了相当大的诊断和治疗挑战。为了应对这些挑战,本研究提出了一种精细且全面的策略,利用先进的深度学习技术对白血病亚型进行分类。我们首先开发了一个分层的标签分类法,为区分白血病的各种亚型铺平了道路。该研究进一步引入了一种新颖的分层方法,该方法受临床程序的启发,能够准确地对各种类型的白血病以及反应性和健康细胞进行分类。本研究的一个组成部分涉及对卷积神经网络 (CNN) 和视觉变压器 (ViT) 作为分类器的性能进行细致检查。所提出的方法展示了令人印象深刻的成功率,在所有白血病亚型中实现了大约 90% 的准确率,我们的实验结果证实了这一点。提供了实验结果的可视化表示,以增强模型的可解释性并帮助理解分类过程。 + +##### **An Empirical Analysis of Uncertainty in Large Language Model Evaluations** +2502.10709v1 by Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang + +As LLM-as-a-Judge emerges as a new paradigm for assessing large language +models (LLMs), concerns have been raised regarding the alignment, bias, and +stability of LLM evaluators. While substantial work has focused on alignment +and bias, little research has concentrated on the stability of LLM evaluators. +In this paper, we conduct extensive experiments involving 9 widely used LLM +evaluators across 2 different evaluation settings to investigate the +uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators +exhibit varying uncertainty based on model families and sizes. With careful +comparative analyses, we find that employing special prompting strategies, +whether during inference or post-training, can alleviate evaluation uncertainty +to some extent. By utilizing uncertainty to enhance LLM's reliability and +detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an +uncertainty-aware LLM evaluator named ConfiLM using a human-annotated +fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually +designed test set sourced from the 2024 Olympics. Experimental results +demonstrate that incorporating uncertainty as additional information during the +fine-tuning phase can largely improve the model's evaluation performance in OOD +scenarios. The code and data are released at: +https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty. + +摘要:隨著 LLM 作為法官的新典範出現,用於評估大型語言模型 (LLM) 的 LLM 評估器在對齊、偏差和穩定性方面引發了關注。儘管大量工作集中在對齊和偏差上,但很少有研究集中在 LLM 評估器的穩定性上。在本文中,我們進行了廣泛的實驗,涉及 9 個廣泛使用的 LLM 評估器,跨越 2 個不同的評估設定,以調查基於模型的 LLM 評估中的不確定性。我們精確指出 LLM 評估器根據模型系列和大小表現出不同的不確定性。通過仔細的比較分析,我們發現採用特殊的提示策略(無論是在推理過程中還是訓練後)可以在一定程度上緩解評估不確定性。通過利用不確定性來增強 LLM 在 Out-Of-Distribution (OOD) 數據中的可靠性和檢測能力,我們進一步微調了一個名為 ConfiLM 的不確定性感知 LLM 評估器,使用人工註釋的微調設置,並評估 ConfiLM 在手動設計的、來自 2024 年奧運會的測試集上的 OOD 評估能力。實驗結果表明,在微調階段將不確定性作為附加信息納入其中可以在很大程度上提高模型在 OOD 場景中的評估性能。代碼和數據發布於: +https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty。 + +##### **Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model** +2502.10707v1 by Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong + +Electrocardiogram (ECG) is essential for the clinical diagnosis of +arrhythmias and other heart diseases, but deep learning methods based on ECG +often face limitations due to the need for high-quality annotations. Although +previous ECG self-supervised learning (eSSL) methods have made significant +progress in representation learning from unannotated ECG data, they typically +treat ECG signals as ordinary time-series data, segmenting the signals using +fixed-size and fixed-step time windows, which often ignore the form and rhythm +characteristics and latent semantic relationships in ECG signals. In this work, +we introduce a novel perspective on ECG signals, treating heartbeats as words +and rhythms as sentences. Based on this perspective, we first designed the +QRS-Tokenizer, which generates semantically meaningful ECG sentences from the +raw ECG signals. Building on these, we then propose HeartLang, a novel +self-supervised learning framework for ECG language processing, learning +general representations at form and rhythm levels. Additionally, we construct +the largest heartbeat-based ECG vocabulary to date, which will further advance +the development of ECG language processing. We evaluated HeartLang across six +public ECG datasets, where it demonstrated robust competitiveness against other +eSSL methods. Our data and code are publicly available at +https://github.com/PKUDigitalHealth/HeartLang. + +摘要:心電圖 (ECG) 對於心律不整和其他心臟疾病的臨床診斷至關重要,但基於心電圖的深度學習方法通常會因需要高品質註解而面臨限制。儘管先前的 ECG 自我監督學習 (eSSL) 方法在從未註解的 ECG 資料中學習表徵方面取得顯著進展,但它們通常將 ECG 訊號視為普通的時間序列資料,使用固定大小和固定步長的時窗對訊號進行分段,這通常會忽略 ECG 訊號中的形式和節律特徵以及潛在的語義關係。在這項工作中,我們對 ECG 訊號引入了新的觀點,將心跳視為單字,將節律視為句子。基於此觀點,我們首先設計了 QRS-Tokenizer,它從原始 ECG 訊號中產生語義有意義的 ECG 句子。在此基礎上,我們提出了 HeartLang,一種用於 ECG 語言處理的新型自我監督學習框架,在形式和節律層面上學習一般表徵。此外,我們構建了迄今為止最大的基於心跳的 ECG 詞彙表,這將進一步促進 ECG 語言處理的發展。我們在六個公開的 ECG 資料集上評估了 HeartLang,它展示了與其他 eSSL 方法相比的強大競爭力。我們的資料和程式碼可在 https://github.com/PKUDigitalHealth/HeartLang 公開取得。 + +##### **Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction** +2502.10689v1 by Leisheng Yu, Yanxiao Cai, Minxing Zhang, Xia Hu + +The burgeoning volume of electronic health records (EHRs) has enabled deep +learning models to excel in predictive healthcare. However, for high-stakes +applications such as diagnosis prediction, model interpretability remains +paramount. Existing deep learning diagnosis prediction models with intrinsic +interpretability often assign attention weights to every past diagnosis or +hospital visit, providing explanations lacking flexibility and succinctness. In +this paper, we introduce SHy, a self-explaining hypergraph neural network +model, designed to offer personalized, concise and faithful explanations that +allow for interventions from clinical experts. By modeling each patient as a +unique hypergraph and employing a message-passing mechanism, SHy captures +higher-order disease interactions and extracts distinct temporal phenotypes as +personalized explanations. It also addresses the incompleteness of the EHR data +by accounting for essential false negatives in the original diagnosis record. A +qualitative case study and extensive quantitative evaluations on two real-world +EHR datasets demonstrate the superior predictive performance and +interpretability of SHy over existing state-of-the-art models. + +摘要:隨著電子健康紀錄 (EHR) 數量的激增,深度學習模型在預測保健方面表現出色。然而,對於診斷預測等高風險應用,模型的可解釋性仍然至關重要。現有的具有內在可解釋性的深度學習診斷預測模型通常會為每個過去的診斷或醫院就診分配注意力權重,提供的解釋缺乏靈活性且簡潔性。在本文中,我們介紹了 SHy,這是一個自解釋的超圖神經網路模型,旨在提供個性化、簡潔且忠實的解釋,讓臨床專家可以進行干預。通過將每個患者建模為一個獨特的超圖並採用訊息傳遞機制,SHy 捕捉到了高階疾病交互作用,並提取出不同的時間表型作為個性化解釋。它還通過考慮原始診斷記錄中的基本假陰性來解決電子健康紀錄資料的不完整性。對兩個真實世界電子健康紀錄資料集進行的定性案例研究和廣泛的定量評估表明,SHy 在預測效能和可解釋性方面優於現有的最先進模型。 + +##### **ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis** +2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan + +Recent advancements in large language models (LLMs) have demonstrated +extraordinary comprehension capabilities with remarkable breakthroughs on +various vision-language tasks. However, the application of LLMs in generating +reliable medical diagnostic reports remains in the early stages. Currently, +medical LLMs typically feature a passive interaction model where doctors +respond to patient queries with little or no involvement in analyzing medical +images. In contrast, some ChatBots simply respond to predefined queries based +on visual inputs, lacking interactive dialogue or consideration of medical +history. As such, there is a gap between LLM-generated patient-ChatBot +interactions and those occurring in actual patient-doctor consultations. To +bridge this gap, we develop an LLM-based dialogue system, namely proactive +multi-round vision-language interactions for computer-aided diagnosis +(ProMRVL-CAD), to generate patient-friendly disease diagnostic reports. The +proposed ProMRVL-CAD system allows proactive dialogue to provide patients with +constant and reliable medical access via an integration of knowledge graph into +a recommendation system. Specifically, we devise two generators: a Proactive +Question Generator (Pro-Q Gen) to generate proactive questions that guide the +diagnostic procedure and a Multi-Vision Patient-Text Diagnostic Report +Generator (MVP-DR Gen) to produce high-quality diagnostic reports. Evaluating +two real-world publicly available datasets, MIMIC-CXR and IU-Xray, our model +has better quality in generating medical reports. We further demonstrate the +performance of ProMRVL achieves robust under the scenarios with low image +quality. Moreover, we have created a synthetic medical dialogue dataset that +simulates proactive diagnostic interactions between patients and doctors, +serving as a valuable resource for training LLM. + +摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。 + +##### **Optimizing CNN Architectures for Advanced Thoracic Disease Classification** +2502.10614v1 by Tejas Mirthipati + +Machine learning, particularly convolutional neural networks (CNNs), has +shown promise in medical image analysis, especially for thoracic disease +detection using chest X-ray images. In this study, we evaluate various CNN +architectures, including binary classification, multi-label classification, and +ResNet50 models, to address challenges like dataset imbalance, variations in +image quality, and hidden biases. We introduce advanced preprocessing +techniques such as principal component analysis (PCA) for image compression and +propose a novel class-weighted loss function to mitigate imbalance issues. Our +results highlight the potential of CNNs in medical imaging but emphasize that +issues like unbalanced datasets and variations in image acquisition methods +must be addressed for optimal model performance. + +摘要:機器學習,特別是卷積神經網路 (CNN) 已在醫學影像分析中展現出潛力,特別是使用胸部 X 光影像進行胸腔疾病偵測。在此研究中,我們評估各種 CNN 架構,包括二元分類、多標籤分類和 ResNet50 模型,以解決資料集不平衡、影像品質差異和隱藏偏差等挑戰。我們導入進階前處理技術,例如主成分分析 (PCA) 以進行影像壓縮,並提出一個新穎的類別加權損失函數來緩解不平衡問題。我們的結果突顯了 CNN 在醫學影像中的潛力,但強調必須解決資料集不平衡和影像擷取方法差異等問題,才能獲得最佳模型效能。 + +##### **PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation** +2502.10536v1 by Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner + +The interpretation of histopathology cases underlies many important +diagnostic and treatment decisions in medicine. Notably, this process typically +requires pathologists to integrate and summarize findings across multiple +slides per case. Existing vision-language capabilities in computational +pathology have so far been largely limited to small regions of interest, larger +regions at low magnification, or single whole-slide images (WSIs). This limits +interpretation of findings that span multiple high-magnification regions across +multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model +(LMM) with a 1-million token context window, we demonstrate the ability to +generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches +from multiple WSIs at 10X magnification. This is the equivalent of up to 11 +hours of video at 1 fps. Expert pathologist evaluations demonstrate that the +generated report text is clinically accurate and equivalent to or preferred +over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide +examples with up to 5 slides. While performance decreased for examples with 6 +or more slides, this study demonstrates the promise of leveraging the +long-context capabilities of modern LMMs for the uniquely challenging task of +medical report generation where each case can contain thousands of image +patches. + +摘要:組織病理學病例的解讀是許多重要的醫學診斷和治療決策的基礎。值得注意的是,這個過程通常需要病理學家整合和總結每個病例的許多玻片中的發現。迄今為止,計算機病理學中現有的視覺語言功能在很大程度上僅限於小範圍的感興趣區域、低倍率下的較大區域或單一的全玻片影像 (WSI)。這限制了跨多個 WSI 中多個高倍率區域的發現的解讀。通過使用 Gemini 1.5 Flash,一個具有 100 萬個令牌上下文視窗的大型多模態模型 (LMM),我們展示了從多個 WSI 中多達 40,000 個 768x768 像素圖像貼片(10 倍放大)生成底線診斷的能力。這相當於 1 fps 下長達 11 小時的影片。專家病理學家評估表明,生成的報告文字在臨床上是準確的,並且等同於或優於 68%(95% CI:[60%,76%])的多玻片範例(最多 5 個玻片)的原始報告。儘管對於有 6 個或更多玻片的範例,其性能下降,但這項研究證明了利用現代 LMM 的長上下文功能來應對獨特挑戰性的醫療報告生成任務,其中每個病例可能包含數千個影像貼片,這項任務的前景。 + +##### **Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks** +2502.10526v2 by Venkatesh Sivaraman, Anika Vaishampayan, Xiaotong Li, Brian R Buck, Ziyong Ma, Richard D Boyce, Adam Perer + +Temporal predictive models have the potential to improve decisions in health +care, public services, and other domains, yet they often fail to effectively +support decision-makers. Prior literature shows that many misalignments between +model behavior and decision-makers' expectations stem from issues of model +specification, namely how, when, and for whom predictions are made. However, +model specifications for predictive tasks are highly technical and difficult +for non-data-scientist stakeholders to interpret and critique. To address this +challenge we developed Tempo, an interactive system that helps data scientists +and domain experts collaboratively iterate on model specifications. Using +Tempo's simple yet precise temporal query language, data scientists can quickly +prototype specifications with greater transparency about pre-processing +choices. Moreover, domain experts can assess performance within data subgroups +to validate that models behave as expected. Through three case studies, we +demonstrate how Tempo helps multidisciplinary teams quickly prune infeasible +specifications and identify more promising directions to explore. + +摘要:時序預測模型有潛力改善醫療保健、公共服務和其他領域的決策,但它們經常無法有效支援決策者。先前的文獻顯示,模型行為與決策者期望之間的許多不一致源自於模型規範問題,也就是如何、何時以及針對誰進行預測。然而,預測任務的模型規範非常技術化,非數據科學家利害關係人難以解讀和批評。為了應對此挑戰,我們開發了 Tempo,一個互動式系統,可協助數據科學家和領域專家協同反覆運算模型規範。透過使用 Tempo 簡單但精確的時序查詢語言,數據科學家可以快速建構規範原型,並更透明地了解前處理的選擇。此外,領域專家可以評估資料子群組內的效能,以驗證模型是否如預期般運作。透過三個案例研究,我們展示 Tempo 如何協助跨領域團隊快速刪減不可行的規範,並找出更有希望探索的方向。 + +##### **A Robust Attack: Displacement Backdoor Attack** +2502.10490v1 by Yong Li, Han Gao + +As artificial intelligence becomes more prevalent in our lives, people are +enjoying the convenience it brings, but they are also facing hidden threats, +such as data poisoning and adversarial attacks. These threats can have +disastrous consequences for the application of artificial intelligence, +especially for some applications that take effect immediately, such as +autonomous driving and medical fields. Among these threats, backdoor attacks +have left a deep impression on people with their concealment and simple +deployment, making them a threat that cannot be ignored, however, in the +process of deploying the backdoor model, the backdoor attack often has some +reasons that make it unsatisfactory in real-world applications, such as jitter +and brightness changes. Based on this, we propose a highly robust backdoor +attack that shifts the target sample and combines it with itself to form a +backdoor sample, the Displacement Backdoor Attack(DBA). Experimental results +show that the DBA attack can resist data augmentation that simulates real-world +differences, such as rotation and cropping. + +摘要:随着人工智能在我们的生活中变得越来越普遍,人们正在享受它带来的便利,但也面临着隐藏的威胁,例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果,特别是对于一些立即生效的应用,例如自动驾驶和医疗领域。在这些威胁中,后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象,使其成为不可忽视的威胁,然而,在部署后门模型的过程中,后门攻击往往存在一些使其在实际应用中不尽如人意的原因,例如抖动和亮度变化。基于此,我们提出了一种高度鲁棒的后门攻击,该攻击对目标样本进行平移并将其与自身结合以形成后门样本,即置换后门攻击 (DBA)。实验结果表明,DBA 攻击可以抵抗模拟真实世界差异的数据增强,例如旋转和裁剪。 + +##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** +2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker + +Explainability remains a significant problem for AI models in medical +imaging, making it challenging for clinicians to trust AI-driven predictions. +We introduce 3D ReX, the first causality-based post-hoc explainability tool for +3D models. 3D ReX uses the theory of actual causality to generate +responsibility maps which highlight the regions most crucial to the model's +decision. We test 3D ReX on a stroke detection model, providing insight into +the spatial distribution of features relevant to stroke. + +摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 +我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 + +##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model** +2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott + +In the analysis of remote healthcare monitoring data, time series +representation learning offers substantial value in uncovering deeper patterns +of patient behavior, especially given the fine temporal granularity of the +data. In this study, we focus on a dataset of home activity records from people +living with Dementia. We propose a two-stage self-supervised learning approach. +The first stage involves converting time-series activities into text strings, +which are then encoded by a fine-tuned language model. In the second stage, +these time-series vectors are bi-dimensionalized for applying PageRank method, +to analyze latent state transitions to quantitatively assess participants +behavioral patterns and identify activity biases. These insights, combined with +diagnostic data, aim to support personalized care interventions. + +摘要:在遠程醫療監控數據分析中,時序表示學習在揭示患者行為的更深層模式方面提供了實質性的價值,特別是考慮到數據的精細時間粒度。在本研究中,我們專注於痴呆症患者居家活動記錄的數據集。我們提出了一種兩階段的自我監督學習方法。第一階段涉及將時序活動轉換為文本串,然後由微調語言模型編碼。在第二階段,這些時序向量被雙維化以應用 PageRank 方法,分析潛在狀態轉換以定量評估參與者的行為模式並識別活動偏差。這些見解與診斷數據相結合,旨在支持個性化護理干預。 + +##### **TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation** +2502.09931v1 by Ju-Hyeon Nam, Nur Suriza Syazwany, Sang-Chul Lee + +Skip connection engineering is primarily employed to address the semantic gap +between the encoder and decoder, while also integrating global dependencies to +understand the relationships among complex anatomical structures in medical +image segmentation. Although several models have proposed transformer-based +approaches to incorporate global dependencies within skip connections, they +often face limitations in capturing detailed local features with high +computational complexity. In contrast, graph neural networks (GNNs) exploit +graph structures to effectively capture local and global features. Leveraging +these properties, we introduce an attentional cross-scale graph neural network +(ACS-GNN), which enhances the skip connection framework by converting +cross-scale feature maps into a graph structure and capturing complex +anatomical structures through node attention. Additionally, we observed that +deep learning models often produce uninformative feature maps, which degrades +the quality of spatial attention maps. To address this problem, we integrated +entropy-driven feature selection (EFS) with spatial attention, calculating an +entropy score for each channel and filtering out high-entropy feature maps. Our +innovative framework, TransGUNet, comprises ACS-GNN and EFS-based spatial +attentio} to effectively enhance domain generalizability across various +modalities by leveraging GNNs alongside a reliable spatial attention map, +ensuring more robust features within the skip connection. Through comprehensive +experiments and analysis, TransGUNet achieved superior segmentation performance +on six seen and eight unseen datasets, demonstrating significantly higher +efficiency compared to previous methods. + +摘要:跳躍連接工程主要用於解決編碼器和解碼器之間的語義鴻溝,同時還整合全局依賴關係以了解醫學影像分割中複雜解剖結構之間的關係。儘管有幾個模型提出了基於Transformer的架構來整合跳躍連接中的全局依賴關係,但它們在以高計算複雜度擷取詳細的局部特徵時常常面臨限制。相比之下,圖神經網路 (GNN) 利用圖結構有效擷取局部和全局特徵。利用這些屬性,我們引入了注意力跨尺度圖神經網路 (ACS-GNN),它通過將跨尺度特徵圖轉換為圖結構並通過節點注意力擷取複雜的解剖結構來增強跳躍連接框架。此外,我們觀察到深度學習模型通常會產生無意義的特徵圖,這會降低空間注意力圖的品質。為了解決這個問題,我們將熵驅動特徵選擇 (EFS) 與空間注意力整合在一起,為每個通道計算熵分數並濾出高熵特徵圖。我們創新的框架 TransGUNet 包含 ACS-GNN 和基於 EFS 的空間注意力,通過利用 GNN 以及可靠的空間注意力圖有效增強跨各種模態的域泛化能力,確保跳躍連接中更強大的特徵。透過全面的實驗和分析,TransGUNet 在六個已見和八個未見的資料集上實現了優異的分割效能,證明與先前的方法相比,效率顯著提高。 + +##### **Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos** +2502.09886v1 by Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, Pieter Abbeel + +Simulation offers a promising approach for cheaply scaling training data for +generalist policies. To scalably generate data from diverse and realistic +tasks, existing algorithms either rely on large language models (LLMs) that may +hallucinate tasks not interesting for robotics; or digital twins, which require +careful real-to-sim alignment and are hard to scale. To address these +challenges, we introduce Video2Policy, a novel framework that leverages +internet RGB videos to reconstruct tasks based on everyday human behavior. Our +approach comprises two phases: (1) task generation in simulation from videos; +and (2) reinforcement learning utilizing in-context LLM-generated reward +functions iteratively. We demonstrate the efficacy of Video2Policy by +reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, +which depicts diverse and complex human behaviors on 9 different tasks. Our +method can successfully train RL policies on such tasks, including complex and +challenging tasks such as throwing. Finally, we show that the generated +simulation data can be scaled up for training a general policy, and it can be +transferred back to the real robot in a Real2Sim2Real way. + +摘要:模擬提供了一種有前途的方法,可以用於擴展訓練資料,以制定通才政策。為了從多樣化且逼真的任務中可擴充地產生資料,現有演算法仰賴大型語言模型 (LLM),這些模型可能會產生對機器人技術不感興趣的任務;或者仰賴數位雙胞胎,這需要仔細地將真實環境與模擬環境對齊,而且很難擴充。為了應對這些挑戰,我們引入了 Video2Policy,這是一個新穎的架構,它利用網路上的 RGB 影片,根據日常人類行為來重建任務。我們的做法包含兩個階段:(1) 從影片中在模擬環境中產生任務;以及 (2) 利用在情境中由 LLM 產生的獎勵函數,反覆進行強化學習。我們透過重建 Something-Something-v2 (SSv2) 資料集中的 100 多個影片來展示 Video2Policy 的效能,這些影片描繪了 9 項不同任務中多樣化且複雜的人類行為。我們的做法可以在這些任務上成功訓練 RL 政策,包括複雜且具挑戰性的任務,例如投擲。最後,我們展示了產生的模擬資料可以擴充到訓練一般政策,而且可以透過 Real2Sim2Real 的方式轉移回真實機器人。 + +##### **HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation** +2502.09838v2 by Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi + +We present HealthGPT, a powerful Medical Large Vision-Language Model +(Med-LVLM) that integrates medical visual comprehension and generation +capabilities within a unified autoregressive paradigm. Our bootstrapping +philosophy is to progressively adapt heterogeneous comprehension and generation +knowledge to pre-trained large language models (LLMs). This is achieved through +a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is +complemented by a tailored hierarchical visual perception approach and a +three-stage learning strategy. To effectively learn the HealthGPT, we devise a +comprehensive medical domain-specific comprehension and generation dataset +called VL-Health. Experimental results demonstrate exceptional performance and +scalability of HealthGPT in medical visual unified tasks. Our project can be +accessed at https://github.com/DCDmllm/HealthGPT. + +摘要:我們提出 HealthGPT,一種強大的醫學大型視覺語言模型 (Med-LVLM),它整合了醫學視覺理解和生成能力於一個統一的自動迴歸範例中。我們的引導哲學是逐步調整異質理解和生成知識以預先訓練大型語言模型 (LLM)。這是通過一種新穎的異質低秩適應 (H-LoRA) 技術實現的,該技術由量身定制的分層視覺感知方法和三階段學習策略補充。為了有效學習 HealthGPT,我們設計了一個全面的醫學領域特定理解和生成數據集,稱為 VL-Health。實驗結果證明了 HealthGPT 在醫學視覺統一任務中的卓越性能和可擴展性。我們的項目可以在 https://github.com/DCDmllm/HealthGPT 中訪問。 + +##### **Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games** +2502.09780v1 by Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi + +Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of +applications involving the interaction of a group of agents in a shared unknown +environment. A prominent framework for studying MARL is Markov games, with the +goal of finding various notions of equilibria in a sample-efficient manner, +such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). +However, existing sample-efficient approaches either require tailored +uncertainty estimation under function approximation, or careful coordination of +the players. In this paper, we propose a novel model-based algorithm, called +VMG, that incentivizes exploration via biasing the empirical estimate of the +model parameters towards those with a higher collective best-response values of +all the players when fixing the other players' policies, thus encouraging the +policy to deviate from its current equilibrium for more exploration. VMG is +oblivious to different forms of function approximation, and permits +simultaneous and uncoupled policy updates of all players. Theoretically, we +also establish that VMG achieves a near-optimal regret for finding both the NEs +of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov +games under linear function approximation in an online environment, which +nearly match their counterparts with sophisticated uncertainty quantification. + +摘要:多智能體強化學習 (MARL) 是一系列應用程式的心臟,這些應用程式涉及一群智能體在一個共用未知環境中的互動。研究 MARL 的一個著名框架是馬可夫博弈,其目標是用樣本有效率的方式找出各種均衡概念,例如納許均衡 (NE) 和粗相關均衡 (CCE)。然而,現有的樣本有效率方法需要在函數逼近下進行量身打造的不確定性估計,或謹慎協調參與者。在本文中,我們提出了一種新的基於模型的演算法,稱為 VMG,它透過將模型參數的經驗估計值偏向於在固定其他參與者政策時所有參與者的集體最佳反應值,從而激勵探索,進而鼓勵政策偏離其當前均衡以進行更多探索。VMG 不會忽略函數逼近的不同形式,並允許所有參與者同時進行非耦合的政策更新。在理論上,我們也建立了 VMG 在線上環境中使用線性函數逼近來尋找雙人零和馬可夫博弈的 NE 和多人一般和馬可夫博弈的 CCE 時,會獲得接近最佳的後悔,這幾乎與其在不確定性量化方面更為複雜的對應物相匹配。 + +##### **The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention** +2502.09757v1 by Bereket A. Yilma, Chan Mi Kim, Geke Ludden, Thomas van Rompay, Luis A. Leiva + +Post-intensive care syndrome (PICS) is a multifaceted condition that arises +from prolonged stays in an intensive care unit (ICU). While preventing PICS +among ICU patients is becoming increasingly important, interventions remain +limited. Building on evidence supporting the effectiveness of art exposure in +addressing the psychological aspects of PICS, we propose a novel art therapy +solution through a collaborative Human-AI approach that enhances personalized +therapeutic interventions using state-of-the-art Visual Art Recommendation +Systems. We developed two Human-in-the-Loop (HITL) personalization methods and +assessed their impact through a large-scale user study (N=150). Our findings +demonstrate that this Human-AI collaboration not only enhances the +personalization and effectiveness of art therapy but also supports therapists +by streamlining their workload. While our study centres on PICS intervention, +the results suggest that human-AI collaborative Art therapy could potentially +benefit other areas where emotional support is critical, such as cases of +anxiety and depression. + +摘要:重症後症候群 (PICS) 是一種多面向的疾病,源自於在加護病房 (ICU) 長期住院。雖然預防重症後症候群在加護病房患者中正變得越來越重要,但介入措施仍然有限。建立在支持藝術接觸在解決重症後症候群心理層面的證據上,我們提出一個創新的藝術療法解決方案,透過協作式的人工智慧方法,使用最先進的視覺藝術推薦系統,增強個人化的治療介入。我們開發了兩種人機迴路 (HITL) 個人化方法,並透過大規模使用者研究 (N=150) 評估其影響。我們的發現證明,這種人機協作不僅增強了藝術治療的個人化和有效性,也透過簡化治療師的工作量來提供支援。雖然我們的研究中心在重症後症候群介入,但結果顯示,人機協作藝術療法有可能對其他需要情緒支持的領域有益,例如焦慮和憂鬱症。 + +##### **A CNN Approach to Automated Detection and Classification of Brain Tumors** +2502.09731v1 by Md. Zahid Hasan, Abdullah Tamim, D. M. Asadujjaman, Md. Mahfujur Rahman, Md. Abu Ahnaf Mollick, Nosin Anjum Dristi, Abdullah-Al-Noman + +Brain tumors require an assessment to ensure timely diagnosis and effective +patient treatment. Morphological factors such as size, location, texture, and +variable appearance complicate tumor inspection. Medical imaging presents +challenges, including noise and incomplete images. This research article +presents a methodology for processing Magnetic Resonance Imaging (MRI) data, +encompassing techniques for image classification and denoising. The effective +use of MRI images allows medical professionals to detect brain disorders, +including tumors. This research aims to categorize healthy brain tissue and +brain tumors by analyzing the provided MRI data. Unlike alternative methods +like Computed Tomography (CT), MRI technology offers a more detailed +representation of internal anatomical components, making it a suitable option +for studying data related to brain tumors. The MRI picture is first subjected +to a denoising technique utilizing an Anisotropic diffusion filter. The dataset +utilized for the models creation is a publicly accessible and validated Brain +Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE +was employed for data augmentation and dataset balancing. Convolutional Neural +Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for +the classification procedure. EfficientNet attained an accuracy of 98%, the +highest recorded. + +摘要:腦腫瘤需要評估以確保及時診斷和有效的患者治療。大小、位置、質地和可變外觀等形態因素會使腫瘤檢查複雜化。醫學影像會呈現挑戰,包括雜訊和不完整的影像。本研究文章提出了一種處理磁共振影像 (MRI) 資料的方法,包含影像分類和去噪技術。有效使用 MRI 影像可讓醫護人員偵測腦部疾病,包括腫瘤。本研究旨在透過分析提供的 MRI 資料來分類健康的腦組織和腦瘤。與電腦斷層掃描 (CT) 等替代方法不同,MRI 技術提供了更詳細的內部解剖結構表示,使其成為研究與腦瘤相關資料的合適選擇。MRI 影像會先使用各向異性擴散濾波器進行去噪技術處理。用於建立模型的資料集是一個公開且經過驗證的腦腫瘤分類 (MRI) 資料庫,包含 3,264 個腦部 MRI 掃描。SMOTE 用於資料擴充和資料集平衡。卷積神經網路 (CNN),例如 ResNet152V2、VGG、ViT 和 EfficientNet,用於分類程序。EfficientNet 達到了 98% 的準確度,是記錄到的最高值。 + +##### **Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data** +2502.09715v1 by Yu Leng, Yingnan He, Colin Magdamo, Ana-Maria Vranceanu, Christine S. Ritchie, Shibani S. Mukerji, Lidia M. V. R. Moura, John R. Dickson, Deborah Blacker, Sudeshna Das + +Identifying cognitive impairment within electronic health records (EHRs) is +crucial not only for timely diagnoses but also for facilitating research. +Information about cognitive impairment often exists within unstructured +clinician notes in EHRs, but manual chart reviews are both time-consuming and +error-prone. To address this issue, our study evaluates an automated approach +using zero-shot GPT-4o to determine stage of cognitive impairment in two +different tasks. First, we evaluated the ability of GPT-4o to determine the +global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who +visited the memory clinic at Massachusetts General Hospital (MGH), and achieved +a weighted kappa score of 0.83. Second, we assessed GPT-4o's ability to +differentiate between normal cognition, mild cognitive impairment (MCI), and +dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o +attained a weighted kappa score of 0.91 in comparison to specialist chart +reviews and 0.96 on cases that the clinical adjudicators rated with high +confidence. Our findings demonstrate GPT-4o's potential as a scalable chart +review tool for creating research datasets and assisting diagnosis in clinical +settings in the future. + +摘要:在電子健康記錄 (EHR) 中識別認知障礙不僅對及時診斷至關重要,也有助於促進研究。有關認知障礙的資訊通常存在於 EHR 中非結構化的臨床記錄中,但手動圖表審查既耗時又容易出錯。為了解決這個問題,我們的研究評估了一種自動化方法,使用零次學習的 GPT-4o 來確定兩種不同任務中的認知障礙分期。首先,我們評估了 GPT-4o 確定來自麻薩諸塞州總醫院 (MGH) 記憶診所 769 名患者的專科記錄的全球臨床痴呆評分 (CDR) 的能力,並獲得了 0.83 的加權 kappa 分數。其次,我們評估了 GPT-4o 在 860 名 Medicare 患者 3 年視窗中的所有記錄中區分正常認知、輕度認知障礙 (MCI) 和痴呆的能力。與專科圖表審查相比,GPT-4o 獲得了 0.91 的加權 kappa 分數,而對於臨床評審員以高度信心評估的病例,其加權 kappa 分數為 0.96。我們的研究結果證明了 GPT-4o 作為可擴充圖表審查工具的潛力,可用於建立研究資料集並協助未來臨床環境中的診斷。 + +##### **Metamorphic Testing for Pose Estimation Systems** +2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque + +Pose estimation systems are used in a variety of fields, from sports +analytics to livestock care. Given their potential impact, it is paramount to +systematically test their behaviour and potential for failure. This is a +complex task due to the oracle problem and the high cost of manual labelling +necessary to build ground truth keypoints. This problem is exacerbated by the +fact that different applications require systems to focus on different subjects +(e.g., human versus animal) or landmarks (e.g., only extremities versus whole +body and face), which makes labelled test data rarely reusable. To combat these +problems we propose MET-POSE, a metamorphic testing framework for pose +estimation systems that bypasses the need for manual annotation while assessing +the performance of these systems under different circumstances. MET-POSE thus +allows users of pose estimation systems to assess the systems in conditions +that more closely relate to their application without having to label an ad-hoc +test dataset or rely only on available datasets, which may not be adapted to +their application domain. While we define MET-POSE in general terms, we also +present a non-exhaustive list of metamorphic rules that represent common +challenges in computer vision applications, as well as a specific way to +evaluate these rules. We then experimentally show the effectiveness of MET-POSE +by applying it to Mediapipe Holistic, a state of the art human pose estimation +system, with the FLIC and PHOENIX datasets. With these experiments, we outline +numerous ways in which the outputs of MET-POSE can uncover faults in pose +estimation systems at a similar or higher rate than classic testing using hand +labelled data, and show that users can tailor the rule set they use to the +faults and level of accuracy relevant to their application. + +摘要:姿勢估計系統應用於各種領域,從運動分析到牲畜照護。鑑於其潛在影響,系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高,這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體(例如,人類對動物)或地標(例如,只有四肢對全身和臉部)而加劇,這使得標記的測試數據很少可以重複使用。為了解決這些問題,我們提出了 MET-POSE,這是一個姿勢估計系統的變形測試框架,在評估這些系統在不同情況下的性能時,可以繞過手動註解的需要。因此,MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統,而無需標記臨時測試數據集或僅依賴可用數據集,這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE,但我們也提供了一個非詳盡的變形規則列表,這些規則代表了電腦視覺應用中的常見挑戰,以及評估這些規則的具體方法。然後,我們通過將 MET-POSE 應用於 Mediapipe Holistic(一種先進的人類姿勢估計系統),並使用 FLIC 和 PHOENIX 數據集,以實驗方式展示 MET-POSE 的有效性。通過這些實驗,我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法,其速度與使用手動標記數據的傳統測試類似或更高,並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。 + +##### **Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling** +2502.09688v1 by Benjamin D. Killeen, Bohua Wan, Aditya V. Kulkarni, Nathan Drenkow, Michael Oberst, Paul H. Yi, Mathias Unberath + +Artificial intelligence (AI) is poised to transform healthcare by enabling +personalized and efficient care through data-driven insights. Although +radiology is at the forefront of AI adoption, in practice, the potential of AI +models is often overshadowed by severe failures to generalize: AI models can +have performance degradation of up to 20% when transitioning from controlled +test environments to clinical use by radiologists. This mismatch raises +concerns that radiologists will be misled by incorrect AI predictions in +practice and/or grow to distrust AI, rendering these promising technologies +practically ineffectual. Exhaustive clinical trials of AI models on abundant +and diverse data is thus critical to anticipate AI model degradation when +encountering varied data samples. Achieving these goals, however, is +challenging due to the high costs of collecting diverse data samples and +corresponding annotations. To overcome these limitations, we introduce a novel +conditional generative AI model designed for virtual clinical trials (VCTs) of +radiology AI, capable of realistically synthesizing full-body CT images of +patients with specified attributes. By learning the joint distribution of +images and anatomical structures, our model enables precise replication of +real-world patient populations with unprecedented detail at this scale. We +demonstrate meaningful evaluation of radiology AI models through VCTs powered +by our synthetic CT study populations, revealing model degradation and +facilitating algorithmic auditing for bias-inducing data attributes. Our +generative AI approach to VCTs is a promising avenue towards a scalable +solution to assess model robustness, mitigate biases, and safeguard patient +care by enabling simpler testing and evaluation of AI models in any desired +range of diverse patient populations. + +摘要:人工智慧 (AI) 準備透過資料驅動的見解,轉型醫療保健,並提供個人化且有效率的照護。儘管放射科處於 AI 採用的最前線,但在實務上,AI 模型的潛力往往會被嚴重的概化失敗所掩蓋:AI 模型在從受控測試環境轉移到放射科醫師的臨床使用時,效能可能會降低多達 20%。這種不匹配引發了疑慮,即放射科醫師在實務上會被不正確的 AI 預測誤導,和/或開始不信任 AI,讓這些有前景的技術在實務上形同失效。因此,在 AI 模型遭遇各種資料範例時,預期 AI 模型的衰退,對豐富且多樣化的資料進行 AI 模型的全面臨床試驗至關重要。然而,由於收集多樣化的資料範例和對應註解的成本很高,實現這些目標具有挑戰性。為了克服這些限制,我們引進一個創新的條件式生成式 AI 模型,專門用於放射科 AI 的虛擬臨床試驗 (VCT),能夠真實地合成具有特定屬性的病患全身電腦斷層 (CT) 影像。透過學習影像和解剖結構的聯合分佈,我們的模型能夠以空前的細節精確複製真實世界的病患族群。我們透過由我們合成的電腦斷層研究族群支援的 VCT,展示了放射科 AI 模型有意義的評估,揭露模型衰退,並促進演算法稽核,以找出導致偏差的資料屬性。我們對 VCT 的生成式 AI 方法,是一個有前景的途徑,可以評估模型的穩健性、減輕偏差,並透過在任何所需的各種病患族群中,進行更簡單的 AI 模型測試和評估,來保障病患照護。 + +##### **Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models** +2502.09687v1 by Wiktoria Mieleszczenko-Kowszewicz, Beata Bajcar, Jolanta Babiak, Berenika Dyczek, Jakub Świstak, Przemysław Biecek + +Be careful what you ask for, you just might get it. This saying fits with the +way large language models (LLMs) are trained, which, instead of being rewarded +for correctness, are increasingly rewarded for pleasing the recipient. So, they +are increasingly effective at persuading us that their answers are valuable. +But what tricks do they use in this persuasion? In this study, we examine what +are the psycholinguistic features of the responses used by twelve different +language models. By grouping response content according to rational or +emotional prompts and exploring social influence principles employed by LLMs, +we ask whether and how we can mitigate the risks of LLM-driven mass +misinformation. We position this study within the broader discourse on +human-centred AI, emphasizing the need for interdisciplinary approaches to +mitigate cognitive and societal risks posed by persuasive AI responses. + +摘要:小心你要求的,你可能真的會得到。這句話適用於大型語言模型 (LLM) 的訓練方式,它們不是因為正確性而獲得獎勵,而是因為取悅接收者而獲得越來越多的獎勵。因此,它們越來越有效地說服我們,它們的答案是有價值的。但是它們在這種說服中使用什麼技巧呢?在這項研究中,我們探討了十二種不同的語言模型使用的回應的心理語言特徵。通過根據理性和情緒提示對回應內容進行分組,並探討 LLM 使用的社會影響原則,我們探討是否以及如何減輕 LLM 驅動的大規模錯誤信息的風險。我們將這項研究定位在以人為中心的 AI 的更廣泛討論中,強調需要跨學科方法來減輕具有說服力的 AI 回應帶來的認知和社會風險。 + +##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics** +2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing + +Joint entity-relation extraction is a critical task in transforming +unstructured or semi-structured text into triplets, facilitating the +construction of large-scale knowledge graphs, and supporting various downstream +applications. Despite its importance, research on Chinese text, particularly +with complex semantics in specialized domains like medicine, remains limited. +To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions +dataset designed to capture the intricacies of medical text. Leveraging the +strengths of attention mechanisms in capturing long-range dependencies, we +propose the SEA module, which enhances the extraction of complex contextual +semantic information, thereby improving entity recognition and relation +extraction. Additionally, to address the inefficiencies of existing methods in +facilitating information exchange between entity recognition and relation +extraction, we present an interactive fusion representation module. This module +employs Cross Attention for bidirectional information exchange between the +tasks and further refines feature extraction through BiLSTM. Experimental +results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that +our model exhibits strong generalization capabilities. On the CH-DDI dataset, +our model achieves an F1-score of 96.73% for entity recognition and 78.43% for +relation extraction. On the CoNLL04 dataset, it attains an entity recognition +precision of 89.54% and a relation extraction accuracy of 71.64%. + +摘要:聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務,有助於建構大規模知識圖譜,並支援各種下游應用程式。儘管其重要性,但針對中文文本的研究,特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距,我們引入了 CH-DDI,一個中文藥物-藥物交互作用資料集,旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢,我們提出了 SEA 模組,增強了複雜脈絡語義資訊的抽取,從而改進了實體辨識和關係抽取。此外,為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題,我們提出了互動式融合表示模組。此模組採用交叉注意力,在任務之間進行雙向資訊交換,並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明,我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上,我們的模型在實體辨識方面達到了 96.73% 的 F1 分數,在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上,它在實體辨識方面達到了 89.54% 的準確度,在關係抽取方面達到了 71.64% 的準確度。 + +##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine** +2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh + +Generative artificial intelligence (AI) models, such as diffusion models and +OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy +and automating clinical workflows. The field has advanced rapidly, evolving +from text-only large language models for tasks such as clinical documentation +and decision support to multimodal AI systems capable of integrating diverse +data modalities, including imaging, text, and structured data, within a single +model. The diverse landscape of these technologies, along with rising interest, +highlights the need for a comprehensive review of their applications and +potential. This scoping review explores the evolution of multimodal AI, +highlighting its methods, applications, datasets, and evaluation in clinical +settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed, +IEEE Xplore, and Web of Science, prioritizing recent studies published up to +the end of 2024. After rigorous screening, 144 papers were included, revealing +key trends and challenges in this dynamic field. Our findings underscore a +shift from unimodal to multimodal approaches, driving innovations in diagnostic +support, medical report generation, drug discovery, and conversational AI. +However, critical challenges remain, including the integration of heterogeneous +data types, improving model interpretability, addressing ethical concerns, and +validating AI systems in real-world clinical settings. This review summarizes +the current state of the art, identifies critical gaps, and provides insights +to guide the development of scalable, trustworthy, and clinically impactful +multimodal AI solutions in healthcare. + +摘要:生成式人工智能 (AI) 模型,例如扩散模型和 OpenAI 的 ChatGPT,通过提高诊断准确性和自动化临床工作流程,正在改变医学领域。该领域已迅速发展,从用于临床文件编制和决策支持等任务的纯文本大型语言模型,发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣,凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变,重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南,我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science,优先考虑截至 2024 年底发表的最新研究。经过严格筛选,纳入了 144 篇论文,揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变,推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而,关键挑战仍然存在,包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术,确定了关键差距,并提供了见解,以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。 + +##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** +2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano + +This paper presents a complete explainable system that interprets a set of +data, abstracts the underlying features and describes them in a natural +language of choice. The system relies on two crucial stages: (i) identifying +emerging properties from data and transforming them into abstract concepts, and +(ii) converting these concepts into natural language. Despite the impressive +natural language generation capabilities demonstrated by Large Language Models, +their statistical nature and the intricacy of their internal mechanism still +force us to employ these techniques as black boxes, forgoing trustworthiness. +Developing an explainable pipeline for data interpretation would allow +facilitating its use in safety-critical environments like processing medical +information and allowing non-experts and visually impaired people to access +narrated information. To this end, we believe that the fields of knowledge +representation and automated reasoning research could present a valid +alternative. Expanding on prior research that tackled the first stage (i), we +focus on the second stage, named Concept2Text. Being explainable, data +translation is easily modeled through logic-based rules, once again emphasizing +the role of declarative programming in achieving AI explainability. This paper +explores a Prolog/CLP-based rewriting system to interpret concepts-articulated +in terms of classes and relations, plus common knowledge-derived from a generic +ontology, generating natural language text. Its main features include +hierarchical tree rewritings, modular multilingual generation, support for +equivalent variants across semantic, grammar, and lexical levels, and a +transparent rule-based system. We outline the architecture and demonstrate its +flexibility through some examples capable of generating numerous diverse and +equivalent rewritings based on the input concept. + +摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 + +##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York** +2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu + +Legal cases require careful logical reasoning following the laws, whereas +interactions with non-technical users must be in natural language. As an +application combining logical reasoning using Prolog and natural language +processing using large language models (LLMs), this paper presents a novel +approach and system, LogicLease, to automate the analysis of landlord-tenant +legal cases in the state of New York. LogicLease determines compliance with +relevant legal requirements by analyzing case descriptions and citing all +relevant laws. It leverages LLMs for information extraction and Prolog for +legal reasoning. By separating information extraction from legal reasoning, +LogicLease achieves greater transparency and control over the legal logic +applied to each case. We evaluate the accuracy, efficiency, and robustness of +LogicLease through a series of tests, achieving 100% accuracy and an average +processing time of 2.57 seconds. LogicLease presents advantages over +state-of-the-art LLM-based legal analysis systems by providing clear, +step-by-step reasoning, citing specific laws, and distinguishing itself by its +ability to avoid hallucinations -- a common issue in LLMs. + +摘要:法律案件需要遵循法律进行谨慎的逻辑推理,而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序,本文提出了一种新颖的方法和系统 LogicLease,以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取,并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开,LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性,实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理,引用具体法律,并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统,从而显示出优势——这是 LLM 中的常见问题。 + +##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia** +2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott + +In remote healthcare monitoring, time series representation learning reveals +critical patient behavior patterns from high-frequency data. This study +analyzes home activity data from individuals living with dementia by proposing +a two-stage, self-supervised learning approach tailored to uncover low-rank +structures. The first stage converts time-series activities into text sequences +encoded by a pre-trained language model, providing a rich, high-dimensional +latent state space using a PageRank-based method. This PageRank vector captures +latent state transitions, effectively compressing complex behaviour data into a +succinct form that enhances interpretability. This low-rank representation not +only enhances model interpretability but also facilitates clustering and +transition analysis, revealing key behavioral patterns correlated with +clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the +framework's potential in supporting cognitive status prediction, personalized +care interventions, and large-scale health monitoring. + +摘要:在遠程醫療監控中,時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據,該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列,使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換,有效地將複雜的行為數據壓縮成簡潔的形式,從而增強了解力。此低秩表示不僅增強了模型的可解釋性,還促進了聚類和轉換分析,揭示了與臨床指標(例如 MMSE 和 ADAS-COG 分數)相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。 + +##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design** +2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang + +Taste peptides have emerged as promising natural flavoring agents attributed +to their unique organoleptic properties, high safety profile, and potential +health benefits. However, the de novo identification of taste peptides derived +from animal, plant, or microbial sources remains a time-consuming and +resource-intensive process, significantly impeding their widespread application +in the food industry. Here, we present TastePepAI, a comprehensive artificial +intelligence framework for customized taste peptide design and safety +assessment. As the key element of this framework, a loss-supervised adaptive +variational autoencoder (LA-VAE) is implemented to efficiently optimizes the +latent representation of sequences during training and facilitates the +generation of target peptides with desired taste profiles. Notably, our model +incorporates a novel taste-avoidance mechanism, allowing for selective flavor +exclusion. Subsequently, our in-house developed toxicity prediction algorithm +(SpepToxPred) is integrated in the framework to undergo rigorous safety +evaluation of generated peptides. Using this integrated platform, we +successfully identified 73 peptides exhibiting sweet, salty, and umami, +significantly expanding the current repertoire of taste peptides. This work +demonstrates the potential of TastePepAI in accelerating taste peptide +discovery for food applications and provides a versatile framework adaptable to +broader peptide engineering challenges. + +摘要:味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而,从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程,严重阻碍了它们在食品工业中的广泛应用。在此,我们提出了 TastePepAI,这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素,实现了损失监督自适应变分自动编码器 (LA-VAE),以在训练期间有效优化序列的潜在表示,并促进生成具有所需味觉特征的目标肽。值得注意的是,我们的模型包含了一种新颖的味觉回避机制,允许选择性排除风味。随后,我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中,以对生成的肽进行严格的安全评估。使用这个集成平台,我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽,极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力,并提供了一个适用于更广泛的肽工程挑战的多功能框架。 + +##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification** +2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan + +Precise segmentation and classification of cell instances are vital for +analyzing the tissue microenvironment in histology images, supporting medical +diagnosis, prognosis, treatment planning, and studies of brain +cytoarchitecture. However, the creation of high-quality annotated datasets for +training remains a major challenge. This study introduces a novel single-stage +approach (HistoSmith) for generating image-label pairs to augment histology +datasets. Unlike state-of-the-art methods that utilize diffusion models with +separate components for label and image generation, our approach employs a +latent diffusion model to learn the joint distribution of cellular layouts, +classification masks, and histology images. This model enables tailored data +generation by conditioning on user-defined parameters such as cell types, +quantities, and tissue types. Trained on the Conic H&E histopathology dataset +and the Nissl-stained CytoDArk0 dataset, the model generates realistic and +diverse labeled samples. Experimental results demonstrate improvements in cell +instance segmentation and classification, particularly for underrepresented +cell types like neutrophils in the Conic dataset. These findings underscore the +potential of our approach to address data scarcity challenges. + +摘要:精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而,建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith),用於產生影像標籤對,以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同,我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數(例如細胞類型、數量和組織類型)來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後,此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步,特別是對於 Conic 資料集中代表性不足的細胞類型,例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。 + +##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion** +2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì + +The growing availability of longitudinal Magnetic Resonance Imaging (MRI) +datasets has facilitated Artificial Intelligence (AI)-driven modeling of +disease progression, making it possible to predict future medical scans for +individual patients. However, despite significant advancements in AI, current +methods continue to face challenges including achieving patient-specific +individualization, ensuring spatiotemporal consistency, efficiently utilizing +longitudinal data, and managing the substantial memory demands of 3D scans. To +address these challenges, we propose Brain Latent Progression (BrLP), a novel +spatiotemporal model designed to predict individual-level disease progression +in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates +in a small latent space, mitigating the computational challenges posed by +high-dimensional imaging data; (ii) it explicitly integrates subject metadata +to enhance the individualization of predictions; (iii) it incorporates prior +knowledge of disease dynamics through an auxiliary model, facilitating the +integration of longitudinal data; and (iv) it introduces the Latent Average +Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in +the predicted progression at inference time and (b) allows us to derive a +measure of the uncertainty for the prediction. We train and evaluate BrLP on +11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its +generalizability on an external test set comprising 2,257 MRIs from 962 +subjects. Our experiments compare BrLP-generated MRI scans with real follow-up +MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The +code is publicly available at: https://github.com/LemuelPuglisi/BrLP. + +摘要:隨著縱向磁共振影像 (MRI) 資料集的日益普及,已促進人工智慧 (AI) 驅動的疾病進程建模,讓預測個別患者的未來醫學掃描成為可能。然而,儘管 AI 有顯著進展,目前的技術仍面臨挑戰,包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料,以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰,我們提出腦潛在進程 (BrLP),這是一種新穎的時空模型,旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個:(i) 它在一個小的潛在空間中運作,減輕了高維度影像資料帶來的計算挑戰;(ii) 它明確整合受試者的元資料,以增強預測的個別化;(iii) 它透過輔助模型納入疾病動態的先驗知識,促進縱向資料的整合;(iv) 它引入了潛在平均穩定化 (LAS) 演算法,該演算法 (a) 在推論時強制預測進程中的時空一致性,(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估,並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較,與現有方法相比,展示了最先進的準確性。程式碼已公開於:https://github.com/LemuelPuglisi/BrLP。 + +##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data** +2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai + +The adoption of EHRs has expanded opportunities to leverage data-driven +algorithms in clinical care and research. A major bottleneck in effectively +conducting multi-institutional EHR studies is the data heterogeneity across +systems with numerous codes that either do not exist or represent different +clinical concepts across institutions. The need for data privacy further limits +the feasibility of including multi-institutional patient-level data required to +study similarities and differences across patient subgroups. To address these +challenges, we developed the GAME algorithm. Tested and validated across 7 +institutions and 2 languages, GAME integrates data in several levels: (1) at +the institutional level with knowledge graphs to establish relationships +between codes and existing knowledge sources, providing the medical context for +standard codes and their relationship to each other; (2) between institutions, +leveraging language models to determine the relationships between +institution-specific codes with established standard codes; and (3) quantifying +the strength of the relationships between codes using a graph attention +network. Jointly trained embeddings are created using transfer and federated +learning to preserve data privacy. In this study, we demonstrate the +applicability of GAME in selecting relevant features as inputs for AI-driven +algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. +We then highlight the application of GAME harmonized multi-institutional EHR +data in a study of Alzheimer's disease outcomes and suicide risk among patients +with mental health disorders, without sharing patient-level data outside +individual institutions. + +摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。 + +##### **EEG Artifact Detection and Correction with Deep Autoencoders** +2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch + +EEG signals convey important information about brain activity both in healthy +and pathological conditions. However, they are inherently noisy, which poses +significant challenges for accurate analysis and interpretation. Traditional +EEG artifact removal methods, while effective, often require extensive expert +intervention. This study presents LSTEEG, a novel LSTM-based autoencoder +designed for the detection and correction of artifacts in EEG signals. +Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear +dependencies in sequential EEG data. LSTEEG demonstrates superior performance +in both artifact detection and correction tasks compared to other +state-of-the-art convolutional autoencoders. Our methodology enhances the +interpretability and utility of the autoencoder's latent space, enabling +data-driven automated artefact removal in EEG its application in downstream +tasks. This research advances the field of efficient and accurate multi-channel +EEG preprocessing, and promotes the implementation and usage of automated EEG +analysis pipelines for brain health applications. + +摘要:腦電圖訊號傳達了關於大腦活動的重要資訊,無論是在健康或病理狀況下。然而,它們本質上是有雜訊的,這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效,但通常需要大量的專家介入。本研究提出 LSTEEG,一種新穎的基於 LSTM 的自動編碼器,用於偵測和校正腦電圖訊號中的人工製品。利用深度學習,特別是 LSTM 層,LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比,LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性,讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域,並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。 + +##### **SycEval: Evaluating LLM Sycophancy** +2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo + +Large language models (LLMs) are increasingly applied in educational, +clinical, and professional settings, but their tendency for sycophancy -- +prioritizing user agreement over independent reasoning -- poses risks to +reliability. This study introduces a framework to evaluate sycophantic behavior +in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and +MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19% +of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the +lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred +in 43.52% of cases, while regressive sycophancy, leading to incorrect answers, +was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher +sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$, +$p<0.001$), particularly in computational tasks, where regressive sycophancy +increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$). +Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while +citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$, +$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI: +[77.2%, 79.8%]) regardless of context or model. These findings emphasize the +risks and opportunities of deploying LLMs in structured and dynamic domains, +offering insights into prompt programming and model optimization for safer AI +applications. + +摘要:大型語言模型(LLM)日益應用於教育、臨床和專業領域,但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為,涉及 AMPS(數學)和 MedQuad(醫療建議)數據集。在 58.19% 的案例中觀察到了趨炎附勢行為,其中 Gemini 表現出最高比率(62.47%),而 ChatGPT 最低(56.71%)。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中,而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率(61.75% 對 56.52%,Z=5.87,p<0.001),特別是在計算任務中,其中退步式趨炎附勢顯著增加(先發制人:8.13%,上下文:3.54%,p<0.001)。簡單的反駁最大化了漸進式趨炎附勢(Z=6.59,p<0.001),而基於引用的反駁表現出最高的退步式比率(Z=6.59,p<0.001)。趨炎附勢行為表現出很高的持續性(78.5%,95% CI:[77.2%,79.8%]),無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇,為更安全的 AI 應用提供了提示編程和模型優化的見解。 + +##### **Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models** +2502.09659v1 by Hasin Rehana, Jie Zheng, Leo Yeh, Benu Bansal, Nur Bengisu Çam, Christianah Jemiyo, Brett McGregor, Arzucan Özgür, Yongqun He, Junguk Hur + +Motivation: An adjuvant is a chemical incorporated into vaccines that +enhances their efficacy by improving the immune response. Identifying adjuvant +names from cancer vaccine studies is essential for furthering research and +enhancing immunotherapies. However, the manual curation from the constantly +expanding biomedical literature poses significant challenges. This study +explores the automated recognition of vaccine adjuvant names using Large +Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) +and Large Language Model Meta AI (Llama). Methods: We utilized two datasets: 97 +clinical trial records from AdjuvareDB and 290 abstracts annotated with the +Vaccine Adjuvant Compendium (VAC). GPT-4o and Llama 3.2 were employed in +zero-shot and few-shot learning paradigms with up to four examples per prompt. +Prompts explicitly targeted adjuvant names, testing the impact of contextual +information such as substances or interventions. Outputs underwent automated +and manual validation for accuracy and consistency. Results: GPT-4o attained +100% Precision across all situations while exhibiting notable improve in Recall +and F1-scores, particularly with incorporating interventions. On the VAC +dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, +surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o +reached an F1-score of 81.67% for three-shot prompting with interventions, +surpassing Llama-3.2-3 B's maximum F1-score of 65.62%. Conclusion: Our findings +demonstrate that LLMs excel at identifying adjuvant names, including rare +variations of naming representation. This study emphasizes the capability of +LLMs to enhance cancer vaccine development by efficiently extracting insights. +Future work aims to broaden the framework to encompass various biomedical +literature and enhance model generalizability across various vaccines and +adjuvants. + +摘要:動機:佐劑是一種加入疫苗的化學物質,能藉由改善免疫反應來提升疫苗的效力。從癌症疫苗研究中找出佐劑名稱對於推進研究和改善免疫療法至關重要。然而,從不斷擴展的生物醫學文獻中手動整理會造成重大挑戰。本研究探討使用大型語言模型 (LLM),特別是生成式預訓練Transformer (GPT) 和大型語言模型 Meta AI (Llama) 來自動辨識疫苗佐劑名稱。方法:我們使用兩個資料集:來自 AdjuvareDB 的 97 份臨床試驗記錄和 290 篇標註了疫苗佐劑彙編 (VAC) 的摘要。GPT-4o 和 Llama 3.2 被用於零次學習和少量學習範例,每個提示最多有四個範例。提示明確鎖定佐劑名稱,測試物質或介入措施等背景資訊的影響。輸出經過自動和手動驗證,以確保準確性和一致性。結果:GPT-4o 在所有情況下都達到 100% 的準確率,同時在召回率和 F1 分數上表現出顯著的進步,特別是在納入介入措施的情況下。在 VAC 資料集上,GPT-4o 在有介入措施的情況下達到 77.32% 的最高 F1 分數,比 Llama-3.2-3B 高出約 2%。在 AdjuvareDB 資料集上,GPT-4o 在有介入措施的三次提示中達到 81.67% 的 F1 分數,超過 Llama-3.2-3 B 的最高 F1 分數 65.62%。結論:我們的研究結果表明,LLM 在辨識佐劑名稱方面表現出色,包括命名表示的罕見變異。本研究強調了 LLM 在有效提取見解方面增強癌症疫苗開發的能力。未來的研究工作旨在擴大架構,涵蓋各種生物醫學文獻,並增強模型在各種疫苗和佐劑中的泛化能力。 + +##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?** +2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace + +Medical research faces well-documented challenges in translating novel +treatments into clinical practice. Publishing incentives encourage researchers +to present "positive" findings, even when empirical results are equivocal. +Consequently, it is well-documented that authors often spin study results, +especially in article abstracts. Such spin can influence clinician +interpretation of evidence and may affect patient care decisions. In this +study, we ask whether the interpretation of trial results offered by Large +Language Models (LLMs) is similarly affected by spin. This is important since +LLMs are increasingly being used to trawl through and synthesize published +medical evidence. We evaluated 22 LLMs and found that they are across the board +more susceptible to spin than humans. They might also propagate spin into their +outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into +plain language summaries that they generate. We also find, however, that LLMs +are generally capable of recognizing spin, and can be prompted in a way to +mitigate spin's impact on LLM outputs. + +摘要:醫學研究在將新穎療法轉化為臨床實務上,面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現,即使經驗結果模稜兩可。因此,有據可查的是,作者經常扭曲研究結果,特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋,並可能影響病患照護決策。在本研究中,我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據,因此這點非常重要。我們評估了 22 個 LLM,發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中:例如,我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而,我們也發現 LLM 通常有能力辨認扭曲,而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。 + +##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating** +2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri + +This paper presents a novel Natural Language Processing (NLP) framework for +enhancing medical diagnosis through the integration of advanced techniques in +data augmentation, feature extraction, and classification. The proposed +approach employs back-translation to generate diverse paraphrased datasets, +improving robustness and mitigating overfitting in classification tasks. +Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with +Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained +contextual and positional relationships, dynamically adjusting the influence of +positional information based on semantic context to produce high-quality text +embeddings. For classification, an Attention-Based Feedforward Neural Network +(ABFNN) is utilized, effectively focusing on the most relevant features to +improve decision-making accuracy. Applied to the classification of symptoms, +clinical notes, and other medical texts, this architecture demonstrates its +ability to address the complexities of medical data. The combination of data +augmentation, contextual embedding generation, and advanced classification +mechanisms offers a robust and accurate diagnostic tool, with potential +applications in automated medical diagnosis and clinical decision support. This +method demonstrates the effectiveness of the proposed NLP framework for medical +diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of +99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only +underscore the model's robust performance in classifying medical texts with +exceptional precision and reliability but also highlight its superiority over +existing methods, making it a highly promising tool for automated diagnostic +systems. + +摘要:本文提出了一個創新的自然語言處理 (NLP) 框架,透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集,提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa),這個模型捕捉細緻的脈絡和位置關係,根據語意脈絡動態調整位置資訊的影響,以產生高品質的文字嵌入。在分類方面,利用基於注意力的前饋神經網路 (ABFNN),有效地關注最相關的特徵,以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類,此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具,在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性,以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數,取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性,也突顯了它優於現有方法的優越性,使其成為自動化診斷系統中極具前景的工具。 + +##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension** +2502.07752v2 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds + +Designing efficient optimizers for large language models (LLMs) with +low-memory requirements and fast convergence is an important and challenging +problem. This paper makes a step towards the systematic design of such +optimizers through the lens of structured Fisher information matrix (FIM) +approximation. We show that many state-of-the-art efficient optimizers can be +viewed as solutions to FIM approximation (under the Frobenius norm) with +specific structural assumptions. Building on these insights, we propose two +design recommendations of practical efficient optimizers for LLMs, involving +the careful selection of structural assumptions to balance generality and +efficiency, and enhancing memory efficiency of optimizers with general +structures through a novel low-rank extension framework. We demonstrate how to +use each design approach by deriving new memory-efficient optimizers: Row and +Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation +(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the +effectiveness, showing faster and better convergence than existing +memory-efficient baselines and Adam with little memory overhead. Notably, Alice +achieves better than 2x faster convergence over Adam, while RACS delivers +strong performance on the 1B model with SGD-like memory. + +摘要:設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的觀點,朝著系統化設計此類最佳化器邁出了一步。我們證明許多最先進的高效最佳化器可以視為 FIM 近似(在 Frobenius 範數下)的解,並具有特定的結構假設。基於這些見解,我們提出了 LLM 的兩個實用高效最佳化器設計建議,包括仔細選擇結構假設以平衡通用性和效率,以及透過新穎的低秩擴充框架增強一般結構最佳化器的記憶體效率。我們透過推導新的記憶體高效最佳化器來展示如何使用每種設計方法:列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練(高達 1B 參數)上的實驗驗證了其有效性,顯示比現有的記憶體高效基準和 Adam 更快且更好的收斂,且記憶體開銷很小。值得注意的是,Alice 的收斂速度比 Adam 快 2 倍以上,而 RACS 則在 1B 模型上提供類似 SGD 的記憶體的強勁效能。 + +##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation** +2502.07516v2 by Raman Dutt + +Generative models, particularly text-to-image (T2I) diffusion models, play a +crucial role in medical image analysis. However, these models are prone to +training data memorization, posing significant risks to patient privacy. +Synthetic chest X-ray generation is one of the most common applications in +medical image analysis with the MIMIC-CXR dataset serving as the primary data +repository for this task. This study presents the first systematic attempt to +identify prompts and text tokens in MIMIC-CXR that contribute the most to +training data memorization. Our analysis reveals two unexpected findings: (1) +prompts containing traces of de-identification procedures (markers introduced +to hide Protected Health Information) are the most memorized, and (2) among all +tokens, de-identification markers contribute the most towards memorization. +This highlights a broader issue with the standard anonymization practices and +T2I synthesis with MIMIC-CXR. To exacerbate, existing inference-time +memorization mitigation strategies are ineffective and fail to sufficiently +reduce the model's reliance on memorized text tokens. On this front, we propose +actionable strategies for different stakeholders to enhance privacy and improve +the reliability of generative models in medical imaging. Finally, our results +provide a foundation for future work on developing and benchmarking +memorization mitigation techniques for synthetic chest X-ray generation using +the MIMIC-CXR dataset. The anonymized code is available at +https://anonymous.4open.science/r/diffusion_memorization-8011/ + +摘要:生成模型,尤其是文本到影像 (T2I) 擴散模型在醫學影像分析中扮演著至關重要的角色。然而,這些模型容易訓練資料記憶,對病患隱私構成重大風險。合成胸部 X 光影像生成是醫學影像分析中最常見的應用之一,而 MIMIC-CXR 資料集則作為此任務的主要資料儲存庫。本研究提出了第一個系統化的嘗試,以識別 MIMIC-CXR 中對訓練資料記憶貢獻最大的提示和文字代碼。我們的分析揭示了兩個出乎意料的發現:(1) 包含去識別程序痕跡的提示(用於隱藏受保護健康資訊的標記)是最容易被記憶的,以及 (2) 在所有代碼中,去識別標記對記憶的貢獻最大。這突顯了標準匿名化實務和使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。更糟的是,現有的推論時間記憶減緩策略無效,無法充分降低模型對記憶文字代碼的依賴。在這個方面,我們針對不同的利害關係人提出可行的策略,以增強隱私和改善生成模型在醫學影像中的可靠性。最後,我們的結果為未來開發和評量使用 MIMIC-CXR 資料集進行合成胸部 X 光影像生成的記憶減緩技術奠定了基礎。已匿名化的程式碼可在 https://anonymous.4open.science/r/diffusion_memorization-8011/ 取得。 + +##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level** +2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo + +Chronic kidney disease (CKD) is a major global health issue, affecting over +10% of the population and causing significant mortality. While kidney biopsy +remains the gold standard for CKD diagnosis and treatment, the lack of +comprehensive benchmarks for kidney pathology segmentation hinders progress in +the field. To address this, we organized the Kidney Pathology Image +Segmentation (KPIs) Challenge, introducing a dataset that incorporates +preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ +Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes +two tasks, patch-level segmentation and whole slide image segmentation and +detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. +By encouraging innovative segmentation methods that adapt to diverse CKD models +and tissue conditions, the KPIs Challenge aims to advance kidney pathology +analysis, establish new benchmarks, and enable precise, large-scale +quantification for disease research and diagnosis. + +摘要:慢性腎臟病 (CKD) 是全球主要的健康問題,影響超過 +10% 的人口,並造成顯著的死亡率。雖然腎臟活檢 +仍然是 CKD 診斷和治療的黃金標準,但缺乏 +腎臟病理學分割的全面基準阻礙了該領域的進展。 +為了解決這個問題,我們組織了腎臟病理影像 +分割 (KPIs) 挑戰,引入了包含超過 10,000 個註解的 +CKD 臨床前嚙齒動物模型的資料集,這些註解來自 60 多個 +週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括 +兩個任務,修補層級分割和全幻燈片影像分割和 +偵測,使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。 +通過鼓勵創新的分割方法來適應不同的 CKD 模型 +和組織條件,KPIs 挑戰旨在推進腎臟病理 +分析,建立新的基準,並實現精確、大規模的 +疾病研究和診斷量化。 + +##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer** +2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu + +Early prediction of pediatric cardiac arrest (CA) is critical for timely +intervention in high-risk intensive care settings. We introduce PedCA-FT, a +novel transformer-based framework that fuses tabular view of EHR with the +derived textual view of EHR to fully unleash the interactions of +high-dimensional risk factors and their dynamics. By employing dedicated +transformer modules for each modality view, PedCA-FT captures complex temporal +and contextual patterns to produce robust CA risk estimates. Evaluated on a +curated pediatric cohort from the CHOA-CICU database, our approach outperforms +ten other artificial intelligence models across five key performance metrics +and identifies clinically meaningful risk factors. These findings underscore +the potential of multimodal fusion techniques to enhance early CA detection and +improve patient care. + +摘要:早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT,一個新穎的基於轉換器的框架,它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起,以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組,PedCA-FT 捕獲複雜的時間和上下文模式,以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估,我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型,並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。 + +##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals** +2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari + +Counterfactual explanations in medical imaging are critical for understanding +the predictions made by deep learning models. We extend the Latent Shift +counterfactual generation method from 2D applications to 3D computed tomography +(CT) scans. We address the challenges associated with 3D data, such as limited +training samples and high memory demands, by implementing a slice-based +approach. This method leverages a 2D encoder trained on CT slices, which are +subsequently combined to maintain 3D context. We demonstrate this technique on +two models for clinical phenotype prediction and lung segmentation. Our +approach is both memory-efficient and effective for generating interpretable +counterfactuals in high-resolution 3D medical imaging. + +摘要:反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法,來解決與 3D 資料相關的挑戰,例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器,隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實,既節省記憶體又有效。 + +##### **Interactive Data Harmonization with LLM Agents** +2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire + +Data harmonization is an essential task that entails integrating datasets +from diverse sources. Despite years of research in this area, it remains a +time-consuming and challenging task due to schema mismatches, varying +terminologies, and differences in data collection methodologies. This paper +presents the case for agentic data harmonization as a means to both empower +experts to harmonize their data and to streamline the process. We introduce +Harmonia, a system that combines LLM-based reasoning, an interactive user +interface, and a library of data harmonization primitives to automate the +synthesis of data harmonization pipelines. We demonstrate Harmonia in a +clinical data harmonization scenario, where it helps to interactively create +reusable pipelines that map datasets to a standard format. Finally, we discuss +challenges and open problems, and suggest research directions for advancing our +vision. + +摘要:資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷,但由於架構不匹配、術語不同,以及資料收集方法的差異,它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和,作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia,一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統,以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia,它有助於互動式建立可重複使用的管線,將資料集對應至標準格式。最後,我們討論挑戰和開放性問題,並建議研究方向以推進我們的願景。 + +##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML** +2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani + +Machine learning (ML) is transforming healthcare by enabling predictive +analytics, personalized treatments, and improved patient outcomes. However, +traditional ML workflows require specialized skills, infrastructure, and +resources, limiting accessibility for many healthcare professionals. This paper +explores how Google Cloud's BigQuery ML simplifies the development and +deployment of ML models using SQL, reducing technical barriers. Through a case +study on diabetes prediction using the Diabetes Health Indicators Dataset, we +evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep +Neural Network (DNN). Our results demonstrate that the Boosted Tree model +achieves the highest performance, making it highly effective for diabetes +prediction. This study highlights BigQuery ML's role in democratizing machine +learning by providing a scalable, efficient, and accessible solution for +healthcare analytics. + +摘要:機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果,正在轉型醫療保健。然而,傳統的 ML 工作流程需要專業技能、基礎設施和資源,限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署,降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究,我們評估了三個預測模型:邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明,提升樹模型達到了最高的效能,使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色,提供可擴充、有效率且可存取的醫療保健分析解決方案。 + +##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements** +2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen + +Despite over a decade of legislative efforts to address modern slavery in the +supply chains of large corporations, the effectiveness of government oversight +remains hampered by the challenge of scrutinizing thousands of statements +annually. While Large Language Models (LLMs) can be considered a well +established solution for the automatic analysis and summarization of documents, +recognizing concrete modern slavery countermeasures taken by companies and +differentiating those from vague claims remains a challenging task. To help +evaluate and fine-tune LLMs for the assessment of corporate statements, we +introduce a dataset composed of 5,731 modern slavery statements taken from the +Australian Modern Slavery Register and annotated at the sentence level. This +paper details the construction steps for the dataset that include the careful +design of annotation specifications, the selection and preprocessing of +statements, and the creation of high-quality annotation subsets for effective +model evaluations. To demonstrate our dataset's utility, we propose a machine +learning methodology for the detection of sentences relevant to mandatory +reporting requirements set by the Australian Modern Slavery Act. We then follow +this methodology to benchmark modern language models under zero-shot and +supervised learning settings. + +摘要:儘管立法努力超過十年,旨在解決大型企業供應鏈中的現代奴隸制,但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型(LLM)可以被認為是文件自動分析和摘要的完善解決方案,但要辨識公司採取的具體現代奴隸制對策,並將其與含糊的聲明區分開來,仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明,我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集,這些聲明取自澳洲現代奴隸制註冊處,並在句子層級進行註解。本文詳細說明了資料集的建構步驟,其中包括註解規格的仔細設計、聲明的選擇和預處理,以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用,我們提出了一種機器學習方法,用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後,我們遵循這種方法,在零次學習和監督學習設定下對現代語言模型進行基準測試。 + +##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium** +2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour + +The fourth Machine Learning for Health (ML4H) symposium was held in person on +December 15th and 16th, 2024, in the traditional, ancestral, and unceded +territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, +British Columbia, Canada. The symposium included research roundtable sessions +to foster discussions between participants and senior researchers on timely and +relevant topics for the ML4H community. The organization of the research +roundtables at the conference involved 13 senior and 27 junior chairs across 13 +tables. Each roundtable session included an invited senior chair (with +substantial experience in the field), junior chairs (responsible for +facilitating the discussion), and attendees from diverse backgrounds with an +interest in the session's topic. + +摘要:第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議,以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席(在該領域擁有豐富的經驗)、初級主席(負責促進討論)以及對會議主題感興趣的來自不同背景的與會者。 + +##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering** +2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla + +Current Large Language Models (LLMs) benchmarks are often based on open-ended +or close-ended QA evaluations, avoiding the requirement of human labor. +Close-ended measurements evaluate the factuality of responses but lack +expressiveness. Open-ended capture the model's capacity to produce discourse +responses but are harder to assess for correctness. These two approaches are +commonly used, either independently or together, though their relationship +remains poorly understood. This work is focused on the healthcare domain, where +both factuality and discourse matter greatly. It introduces a comprehensive, +multi-axis suite for healthcare LLM evaluation, exploring correlations between +open and close benchmarks and metrics. Findings include blind spots and +overlaps in current methodologies. As an updated sanity check, we release a new +medical benchmark --CareQA-- with both open and closed variants. Finally, we +propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to +mitigate the identified limitations. + +摘要:當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量,避免了人力需求。封閉式測量評估回應的事實性,但缺乏表達力。開放式測量捕捉模型產生論述回應的能力,但較難評估正確性。這兩種方法通常獨立或合併使用,儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域,在該領域中,事實性和論述都非常重要。它引入了一個全面的多軸套件,用於醫療保健 LLM 評量,探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查,我們發布了一個新的醫療基準--CareQA--,包含開放式和封閉式變體。最後,我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。 + +##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging** +2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra + +Accurate classification and anatomical localization are essential for +effective medical diagnostics and research, which may be efficiently performed +using deep learning techniques. However, availability of limited labeled data +poses a significant challenge. To address this, we adapted Prototypical +Networks and the Propagation-Reconstruction Network (PRNet) for few-shot +classification and localization, respectively, in Single Photon Emission +Computed Tomography (SPECT) images. For the proof of concept we used a +2D-sliced image cropped around heart. The Prototypical Network, with a +pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver +tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for +2D imaging with an encoder-decoder architecture and skip connections, achieved +a training loss of 1.395, accurately reconstructing patches and capturing +spatial relationships. These results highlight the potential of Prototypical +Networks for tissue classification with limited labeled data and PRNet for +anatomical landmark localization, paving the way for improved performance in +deep learning frameworks. + +摘要:精確的分類和解剖定位對於有效的醫療診斷和研究至關重要,而這可以使用深度學習技術有效執行。然而,標記資料有限的取得會造成重大的挑戰。為了解決這個問題,我們分別調整了原型網路和傳播重建網路 (PRNet),用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念,我們使用圍繞心臟裁切的 2D 切片影像。原型網路,使用預先訓練的 ResNet-18 主幹,對心室、心肌和肝臟組織進行分類,訓練準確度為 96.67%,驗證準確度為 93.33%。PRNet,調整為使用編碼器解碼器架構和跳躍連接的 2D 影像,達到了 1.395 的訓練損失,精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力,以及 PRNet 在解剖標誌定位方面的潛力,為深度學習架構中效能的提升鋪平了道路。 + +##### **Illegal Waste Detection in Remote Sensing Images: A Case Study** +2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori + +Environmental crime currently represents the third largest criminal activity +worldwide while threatening ecosystems as well as human health. Among the +crimes related to this activity, improper waste management can nowadays be +countered more easily thanks to the increasing availability and decreasing cost +of Very-High-Resolution Remote Sensing images, which enable semi-automatic +territory scanning in search of illegal landfills. This paper proposes a +pipeline, developed in collaboration with professionals from a local +environmental agency, for detecting candidate illegal dumping sites leveraging +a classifier of Remote Sensing images. To identify the best configuration for +such classifier, an extensive set of experiments was conducted and the impact +of diverse image characteristics and training settings was thoroughly analyzed. +The local environmental agency was then involved in an experimental exercise +where outputs from the developed classifier were integrated in the experts' +everyday work, resulting in time savings with respect to manual +photo-interpretation. The classifier was eventually run with valuable results +on a location outside of the training area, highlighting potential for +cross-border applicability of the proposed pipeline. + +摘要:環境犯罪目前是全球第三大犯罪活動,威脅生態系統和人類健康。在與此活動相關的犯罪中,不當廢物管理現在可以更容易地得到解決,這要歸功於超高解析度遙測影像越來越普及且成本下降,這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道,與當地環境機構的專業人士合作開發,用於檢測候選非法傾倒地點,利用遙測影像分類器。為了找出這種分類器的最佳配置,進行了一系列廣泛的實驗,並徹底分析了不同影像特徵和訓練設定的影響。然後,當地環境機構參與了一項實驗練習,其中將已開發分類器的輸出整合到專家的日常工作中,從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器,獲得了有價值的結果,突出了所提出管道的跨境適用性潛力。 + +##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model** +2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li + +Accurate and efficient electroencephalography (EEG) analysis is essential for +detecting seizures and artifacts in long-term monitoring, with applications +spanning hospital diagnostics to wearable health devices. Robust EEG analytics +have the potential to greatly improve patient care. However, traditional deep +learning models, especially Transformer-based architectures, are hindered by +their quadratic time and memory complexity, making them less suitable for +resource-constrained environments. To address these challenges, we present +FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel +self-supervised framework that establishes new efficiency benchmarks for EEG +analysis through bidirectional state-space modeling. Unlike Transformer-based +models, which incur quadratic time and memory complexity, FEMBA scales linearly +with sequence length, enabling more scalable and efficient processing of +extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and +fine-tuned on three downstream tasks, FEMBA achieves competitive performance in +comparison with transformer models, with significantly lower computational +cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB +and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates +viability for resource-constrained devices. These results pave the way for +scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as +a promising candidate for wearable applications. + +摘要:準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要,其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而,傳統深度學習模型,特別是基於 Transformer 的架構,受到其二次時間和記憶體複雜度的阻礙,使其不太適合資源受限的環境。為了應對這些挑戰,我們提出 FEMBA (基礎 EEG Mamba + 雙向架構),一種創新的自我監督架構,透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同,FEMBA 隨著序列長度線性縮放,支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調,與Transformer模型相比,在計算成本顯著降低的情況下,實現了具有競爭力的效能。具體來說,它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC,而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路,並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。 + +##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?** +2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham + +The advent of foundation models (FMs) is transforming medical domain. In +ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 +million natural images and 1.6 million retinal images, has demonstrated high +adaptability across clinical applications. Conversely, DINOv2, a +general-purpose vision FM pre-trained on 142 million natural images, has shown +promise in non-medical domains. However, its applicability to clinical tasks +remains underexplored. To address this, we conducted head-to-head evaluations +by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular +disease detection and systemic disease prediction tasks, across eight +standardized open-source ocular datasets, as well as the Moorfields AlzEye and +the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting +diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, +all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In +glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, +P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 +models in predicting heart failure, myocardial infarction, and ischaemic stroke +(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even +with 10% of the fine-tuning data. These findings showcase the distinct +scenarios where general-purpose and domain-specific FMs excel, highlighting the +importance of aligning FM selection with task-specific requirements to optimise +clinical performance. + +摘要:基礎模型 (FM) 的出現正在轉變醫療領域。在眼科,RETFound 是一個視網膜專用 FM,依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練,已展現出高度適應性,可應用於各種臨床應用。相反地,DINOv2 是一個通用視覺 FM,使用 1.42 億張自然影像進行預訓練,已展現出在非醫療領域的潛力。然而,其在臨床任務中的適用性仍未被充分探索。為了解決這個問題,我們針對眼部疾病偵測和全身性疾病預測任務,對 RETFound 和三個 DINOv2 模型(大型、基礎、小型)進行微調,並進行一對一的評估,使用八個標準化的開源眼科資料集,以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound(三個資料集的 AUROC=0.850-0.952,相較於 0.823-0.944,所有 P<=0.007)和多類眼部疾病(AUROC=0.892,相較於 0.846,P<0.001)。在青光眼方面,DINOv2 基礎模型優於 RETFound(AUROC=0.958,相較於 0.940,P<0.001)。相反地,RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型(AUROC=0.732-0.796,相較於 0.663-0.771,所有 P<0.001)。即使使用 10% 的微調資料,這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景,突顯了根據任務特定需求調整 FM 選擇,以最佳化臨床表現的重要性。 + +##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning** +2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun + +Medical time series are often irregular and face significant missingness, +posing challenges for data analysis and clinical decision-making. Existing +methods typically adopt a single modeling perspective, either treating series +data as sequences or transforming them into image representations for further +classification. In this paper, we propose a joint learning framework that +incorporates both sequence and image representations. We also design three +self-supervised learning strategies to facilitate the fusion of sequence and +image representations, capturing a more generalizable joint representation. The +results indicate that our approach outperforms seven other state-of-the-art +models in three representative real-world clinical datasets. We further +validate our approach by simulating two major types of real-world missingness +through leave-sensors-out and leave-samples-out techniques. The results +demonstrate that our approach is more robust and significantly surpasses other +baselines in terms of classification performance. + +摘要:醫療時間序列通常不規則且會面臨顯著的缺失,對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點,將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中,我們提出了一個聯合學習架構,結合序列和影像表示。我們還設計了三種自我監督學習策略,以促進序列和影像表示的融合,捕捉更具概括性的聯合表示。結果表明,我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明,我們的做法更強大,並且在分類效能方面顯著優於其他基準。 + +##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** +2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek + +We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), +an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS +predicts future PHTs using transformer-based architectures. The Adaptive Risk +Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk +probabilities for clinician-defined critical events. ARES incorporates a +personalized explainability module that identifies key clinical factors +influencing risk estimates for individual patients. ARES was evaluated on the +MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its +performance against traditional early warning systems and machine learning +models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, +with 60% including hospital admissions. The dataset contained over 357 million +tokens. ETHOS outperformed benchmark models in predicting hospital admissions, +ICU admissions, and prolonged hospital stays, achieving superior AUC scores. +ETHOS-based risk estimates demonstrated robustness across demographic subgroups +with strong model reliability, confirmed via calibration curves. The +personalized explainability module provides insights into patient-specific +factors contributing to risk. ARES, powered by ETHOS, advances predictive +healthcare AI by providing dynamic, real-time, and personalized risk estimation +with patient-specific explainability to enhance clinician trust. Its +adaptability and superior accuracy position it as a transformative tool for +clinical decision-making, potentially improving patient outcomes and resource +allocation in emergency and inpatient settings. We release the full code at +github.com/ipolharvard/ethos-ares to facilitate future research. + +摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), +一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS +使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 + +##### **Can ChatGPT Diagnose Alzheimer's Disease?** +2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin + +Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating +neurodegenerative condition that affects approximately 1 in 9 individuals aged +65 and older, profoundly impairing memory and cognitive function. This paper +utilises 9300 electronic health records (EHRs) with data from Magnetic +Resonance Imaging (MRI) and cognitive tests to address an intriguing question: +As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs? +We present an in-depth evaluation of ChatGPT using a black-box approach with +zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to +analyse MRI and cognitive test results, as well as its potential as a +diagnostic tool for AD. By automating aspects of the diagnostic process, this +research opens a transformative approach for the healthcare system, +particularly in addressing disparities in resource-limited regions where AD +specialists are scarce. Hence, it offers a foundation for a promising method +for early detection, supporting individuals with timely interventions, which is +paramount for Quality of Life (QoL). + +摘要:ChatGPT 能否診斷出阿茲海默症 (AD)?AD 是一種毀滅性的神經退化性疾病,影響約 1/9 的 65 歲及以上人士,嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR),其中包含磁共振成像 (MRI) 和認知測試的數據,來解決一個有趣的問題:作為一個通用任務解決器,ChatGPT 能否使用 EHR 準確地檢測出 AD?我們使用黑盒方法對 ChatGPT 進行了深入評估,採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力,以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面,這項研究為醫療保健系統開啟了一種變革性的方法,特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此,它為一種有希望的早期檢測方法奠定了基礎,通過及時干預來支持個人,這對於生活品質 (QoL) 至關重要。 + +##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking** +2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares + +EEG-based neural networks, pivotal in medical diagnosis and brain-computer +interfaces, face significant intellectual property (IP) risks due to their +reliance on sensitive neurophysiological data and resource-intensive +development. Current watermarking methods, particularly those using abstract +trigger sets, lack robust authentication and fail to address the unique +challenges of EEG models. This paper introduces a cryptographic wonder +filter-based watermarking framework tailored for EEG-based neural networks. +Leveraging collision-resistant hashing and public-key encryption, the wonder +filter embeds the watermark during training, ensuring minimal distortion ($\leq +5\%$ drop in EEG task accuracy) and high reliability (100\% watermark +detection). The framework is rigorously evaluated against adversarial attacks, +including fine-tuning, transfer learning, and neuron pruning. Results +demonstrate persistent watermark retention, with classification accuracy for +watermarked states remaining above 90\% even after aggressive pruning, while +primary task performance degrades faster, deterring removal attempts. Piracy +resistance is validated by the inability to embed secondary watermarks without +severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic +hashing ensures authentication, reducing brute-force attack success +probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet, +TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively +eliminating false positives. By integrating wonder filters with EEG-specific +adaptations, this work bridges a critical gap in IP protection for +neurophysiological models, offering a secure, tamper-proof solution for +healthcare and biometric applications. The framework's robustness against +adversarial modifications underscores its potential to safeguard sensitive EEG +models while maintaining diagnostic utility. + +摘要:基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要,由於其依賴敏感的神經生理資料和資源密集型的開發,面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法,特別是那些使用抽象觸發集的方法,缺乏強健的驗證,且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密,wonder 濾波器在訓練期間嵌入浮水印,確保最小的失真(EEG 任務準確度下降 $\leq 5\%$)和高可靠性(100% 浮水印檢測)。該架構針對對抗性攻擊進行了嚴格的評估,包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留,即使在激進的剪枝後,浮水印狀態的分類準確度仍保持在 90% 以上,而主要任務的性能下降得更快,阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證,而不會造成嚴重的準確度損失(在 EEGNet 和 CCNN 模型中 $>10\%$)。密碼學雜湊確保驗證,降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型(CCNN、EEGNet、TSception)進行評估,該方法達到了 $>99.4\%$ 的空嵌入準確度,有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合,這項工作彌補了神經生理模型 IP 保護中的關鍵差距,為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。 + +##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models** +2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen + +Depression is one of the leading causes of disability worldwide, posing a +severe burden on individuals, healthcare systems, and society at large. Recent +advancements in Large Language Models (LLMs) have shown promise in addressing +mental health challenges, including the detection of depression through +text-based analysis. However, current LLM-based methods often struggle with +nuanced symptom identification and lack a transparent, step-by-step reasoning +process, making it difficult to accurately classify and explain mental health +conditions. To address these challenges, we propose a Chain-of-Thought +Prompting approach that enhances both the performance and interpretability of +LLM-based depression detection. Our method breaks down the detection process +into four stages: (1) sentiment analysis, (2) binary depression classification, +(3) identification of underlying causes, and (4) assessment of severity. By +guiding the model through these structured reasoning steps, we improve +interpretability and reduce the risk of overlooking subtle clinical indicators. +We validate our method on the E-DAIC dataset, where we test multiple +state-of-the-art large language models. Experimental results indicate that our +Chain-of-Thought Prompting technique yields superior performance in both +classification accuracy and the granularity of diagnostic insights, compared to +baseline approaches. + +摘要:憂鬱症是全球殘障的主要原因之一,對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望,包括透過基於文字的分析來偵測憂鬱症。然而,現有的基於 LLM 的方法通常難以辨識細微的症狀,而且缺乏透明且逐步的推理過程,這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰,我們提出了一種思考鏈提示方法,它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段:(1) 情緒分析,(2) 二元憂鬱症分類,(3) 找出潛在原因,以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟,我們提升了可解釋性,並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法,並在其中測試了多種最先進的大型語言模型。實驗結果顯示,與基線方法相比,我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。 + +##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison** +2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis + +The increasing volume of drug combinations in modern therapeutic regimens +needs reliable methods for predicting drug-drug interactions (DDIs). While +Large Language Models (LLMs) have revolutionized various domains, their +potential in pharmaceutical research, particularly in DDI prediction, remains +largely unexplored. This study thoroughly investigates LLMs' capabilities in +predicting DDIs by uniquely processing molecular structures (SMILES), target +organisms, and gene interaction data as raw text input from the latest DrugBank +dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4, +Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first +assessing their zero-shot capabilities in DDI prediction. We then fine-tuned +selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1 +distilled Qwen 1.5B) to optimize their performance. Our comprehensive +evaluation framework included validation across 13 external DDI datasets, +comparing against traditional approaches such as l2-regularized logistic +regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5 +2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of +0.919 on balanced datasets (50% positive, 50% negative cases). This result +represents an improvement over both zero-shot predictions and state-of-the-art +machine-learning methods used for DDI prediction. Our analysis reveals that +LLMs can effectively capture complex molecular interaction patterns and cases +where drug pairs target common genes, making them valuable tools for practical +applications in pharmaceutical research and clinical settings. + +摘要:現代治療方案中藥物組合的數量越來越多,需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命,它們在藥物研究中的潛力,特別是在 DDI 預測中的潛力,仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入,徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM,包括專有模型(GPT-4、Claude、Gemini)和開源變體(從 1.5B 到 72B 參數),首先評估它們在 DDI 預測中的零次學習能力。然後,我們微調選定的模型(GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B)以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證,並與傳統方法(例如 l2 正則化邏輯迴歸)進行比較。微調後的 LLM 表現出優異的效能,其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度,在平衡資料集(50% 正例,50% 反例)上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明,LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況,使其成為藥物研究和臨床環境中實用應用的寶貴工具。 + +##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)** +2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh + +Detecting sensitive data such as Personally Identifiable Information (PII) +and Protected Health Information (PHI) is critical for data security platforms. +This study evaluates regex-based pattern matching algorithms and exact-match +search techniques to optimize detection speed, accuracy, and scalability. Our +benchmarking results indicate that Google RE2 provides the best balance of +speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among +regex engines, outperforming PCRE while maintaining broader hardware +compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated +superior performance (8 ms/MB) and scalability for large datasets. Performance +analysis revealed that regex processing time scales linearly with dataset size +and pattern complexity. A hybrid AI + Regex approach achieved the highest F1 +score (91. 6%) by improving recall and minimizing false positives. Device +benchmarking confirmed that our solution maintains efficient CPU and memory +usage on both high-performance and mid-range systems. Despite its +effectiveness, challenges remain, such as limited multilingual support and the +need for regular pattern updates. Future work should focus on expanding +language coverage, integrating data security and privacy management (DSPM) with +data loss prevention (DLP) tools, and enhancing regulatory compliance for +broader global adoption. + +摘要:偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料,對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術,以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示,在 regex 引擎中,Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡,優於 PCRE,同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對,Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示,regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低,達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效,但仍有挑戰存在,例如多語言支援有限,以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍,將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合,以及加強法規遵循以利更廣泛的全球採用。 + +##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch** +2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu + +While just-in-time interventions (JITIs) have effectively targeted common +health behaviors, individuals often have unique needs to intervene in personal +undesirable actions that can negatively affect physical, mental, and social +well-being. We present WatchGuardian, a smartwatch-based JITI system that +empowers users to define custom interventions for these personal actions with a +small number of samples. For the model to detect new actions based on limited +new data samples, we developed a few-shot learning pipeline that finetuned a +pre-trained inertial measurement unit (IMU) model on public hand-gesture +datasets. We then designed a data augmentation and synthesis process to train +additional classification layers for customization. Our offline evaluation with +26 participants showed that with three, five, and ten examples, our approach +achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of +74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to +compare WatchGuardian against a rule-based intervention. Our results +demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in +undesirable actions, substantially outperforming the baseline by 29.0%. Our +findings underscore the effectiveness of a customizable, AI-driven JITI system +for individuals in need of behavioral intervention in personal undesirable +actions. We envision that our work can inspire broader applications of +user-defined personalized intervention with advanced AI solutions. + +摘要:雖然即時介入(JITIs)有效地針對常見的健康行為,但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian,這是一個基於智慧手錶的 JITI 系統,它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為,我們開發了一個小樣本學習管道,微調了公共手勢資料集上的預訓練慣性測量單元(IMU)模型。然後,我們設計了一個資料擴充和合成流程,以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示,我們的做法使用三個、五個和十個範例,達到了 76.8%、84.7% 和 87.7% 的平均準確度,以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後,我們進行了一項為時四小時的介入研究,以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明,我們的系統導致不良行為顯著減少了 64.0 +- 22.6%,大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用,並採用先進的 AI 解決方案。 + +##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care** +2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara + +Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group +of cancers that account for more than 35% of cancer-related deaths worldwide, +but postoperative complications are unpredictable and can be life-threatening. +In this paper, we investigate how recent advancements in large language models +(LLMs) can benefit remote patient monitoring (RPM) systems through clinical +integration by designing RECOVER, an LLM-powered RPM system for postoperative +GI cancer care. To closely engage stakeholders in the design process, we first +conducted seven participatory design sessions with five clinical staff and +interviewed five cancer patients to derive six major design strategies for +integrating clinical guidelines and information needs into LLM-based RPM +systems. We then designed and implemented RECOVER, which features an +LLM-powered conversational agent for cancer patients and an interactive +dashboard for clinical staff to enable efficient postoperative RPM. Finally, we +used RECOVER as a pilot system to assess the implementation of our design +strategies with four clinical staff and five patients, providing design +implications by identifying crucial design elements, offering insights on +responsible AI, and outlining opportunities for future LLM-powered RPM systems. + +摘要:癌症手術是胃腸道 (GI) 癌症的主要治療方式,這類癌症佔全球癌症相關死亡人數的 35% 以上,但術後併發症無法預測,且可能危及生命。在本文中,我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統,方法是設計 RECOVER,一個由 LLM 驅動的 RPM 系統,用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程,我們首先與五位臨床人員進行七場參與式設計會議,並訪談五位癌症患者,以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著,我們設計並實作 RECOVER,其特色在於一個由 LLM 驅動的對話式代理人,供癌症患者使用,以及一個互動式儀表板,供臨床人員使用,以進行有效的術後 RPM。最後,我們使用 RECOVER 作為試點系統,與四位臨床人員和五位患者評估我們設計策略的實作,並透過找出重要的設計元素、提供對負責任 AI 的見解,以及概述未來由 LLM 驅動的 RPM 系統的機會,提出設計意涵。 + +##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis** +2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander + +Understanding the progression trajectories of diseases is crucial for early +diagnosis and effective treatment planning. This is especially vital for +life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a +chronic, progressive lung disease with a prognosis comparable to many cancers. +Computed tomography (CT) imaging has been established as a reliable diagnostic +tool for IPF. Accurately predicting future CT scans of early-stage IPF patients +can aid in developing better treatment strategies, thereby improving survival +outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial +Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF +patients at any time point. The model is trained using a two-stage approach. In +the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the +second stage, a Neural Ordinary Differential Equation (ODE) based temporal +model is trained to capture the temporal dynamics of the quantised embeddings +generated by the encoder in the first stage. We evaluate different +configurations of our model for generating longitudinal CT scans and compare +the results against ground truth data, both quantitatively and qualitatively. +For validation, we conduct survival analysis using imaging biomarkers derived +from generated CT scans and achieve a C-index comparable to that of biomarkers +derived from the real CT scans. The survival analysis results demonstrate the +potential clinical utility inherent to generated longitudinal CT scans, showing +that they can reliably predict survival outcomes. + +摘要:了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要,IPF 是一種慢性、進行性肺部疾病,其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略,從而改善存活結果。在本文中,我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN),這是一個模型,能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段,訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段,訓練基於神經常微分方程 (ODE) 的時間模型,以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置,以生成縱向 CT 掃描,並在定量和定性方面將結果與真實數據進行比較。為了驗證,我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析,並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用,表明它們可以可靠地預測存活結果。 + +##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy** +2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho + +The increasing demand for mental health services has led to the rise of +AI-driven mental health chatbots, though challenges related to privacy, data +collection, and expertise persist. Motivational Interviewing (MI) is gaining +attention as a theoretical basis for boosting expertise in the development of +these chatbots. However, existing datasets are showing limitations for training +chatbots, leading to a substantial demand for publicly available resources in +the field of MI and psychotherapy. These challenges are even more pronounced in +non-English languages, where they receive less attention. In this paper, we +propose a novel framework that simulates MI sessions enriched with the +expertise of professional therapists. We train an MI forecaster model that +mimics the behavioral choices of professional therapists and employ Large +Language Models (LLMs) to generate utterances through prompt engineering. Then, +we present KMI, the first synthetic dataset theoretically grounded in MI, +containing 1,000 high-quality Korean Motivational Interviewing dialogues. +Through an extensive expert evaluation of the generated dataset and the +dialogue model trained on it, we demonstrate the quality, expertise, and +practicality of KMI. We also introduce novel metrics derived from MI theory in +order to evaluate dialogues from the perspective of MI. + +摘要:由於對心理健康服務的需求日益增加,導致以人工智慧為基礎的心理健康聊天機器人興起,儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而,現有的資料集顯示出訓練聊天機器人的限制,導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯,因為它們受到的關注較少。在本文中,我們提出了一個新穎的架構,它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型,它模擬了專業治療師的行為選擇,並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後,我們展示了 KMI,這是第一個理論上以 MI 為基礎的合成資料集,其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估,我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標,以便從 MI 的角度評估對話。 + +##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports** +2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco + +Europe's healthcare systems require enhanced interoperability and +digitalization, driving a demand for innovative solutions to process legacy +clinical data. This paper presents the results of our project, which aims to +leverage Large Language Models (LLMs) to extract structured information from +unstructured clinical reports, focusing on patient history, diagnoses, +treatments, and other predefined categories. We developed a workflow with a +user interface and evaluated LLMs of varying sizes through prompting strategies +and fine-tuning. Our results show that fine-tuned smaller models match or +surpass larger counterparts in performance, offering efficiency for +resource-limited settings. A new dataset of 60,000 annotated English clinical +summaries and 24,000 German translations was validated with automated and +manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. +The work highlights the approach's viability and outlines future improvements. + +摘要:歐洲的醫療保健系統需要增強互通性和數位化,這驅動了對創新解決方案的需求,以處理傳統的臨床數據。本文介紹了我們專案的成果,該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊,重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程,並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示,微調後的較小模型在效能上與較大的模型相匹配或超越它們,為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性,並概述了未來的改進。 + diff --git a/docs/index.md b/docs/index.md index 3d4c3c9b19..daca996fb7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,4474 +1,3751 @@ # arxiv-daily - Automated deployment @ 2025-02-23 20:25:02 Asia/Taipei + Automated deployment @ 2025-02-24 09:08:05 Asia/Taipei > Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml). > You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage). ## AI -### Medical explainable AI +### LLM |Publish Date|Title|Authors|Homepage|Code| | :---: | :---: | :---: | :---: | :---: | -|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| -|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| -|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| -|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| -|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null| -|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)| -|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null| -|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null| -|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v2](http://arxiv.org/abs/2501.09628v2)|null| -|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null| -|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null| -|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null| -|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null| -|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null| -|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)| -|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null| -|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null| -|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null| -|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null| -|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null| -|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null| -|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null| -|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null| -|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null| -|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)| -|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null| -|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null| -|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null| -|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)| -|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)| -|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null| -|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)| -|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null| -|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null| -|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null| -|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null| -|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null| -|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null| -|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null| -|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null| -|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare**|Antonio Rago et.al.|[2408.17401v2](http://arxiv.org/abs/2408.17401v2)|null| -|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null| -|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null| -|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null| -|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null| -|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null| -|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null| -|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null| -|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null| -|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null| -|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null| -|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null| -|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null| -|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null| -|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null| -|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)| -|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null| -|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null| -|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null| -|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null| -|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null| -|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null| -|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)| -|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null| -|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null| -|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null| -|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null| -|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null| -|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)| -|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null| -|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)| -|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null| -|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null| -|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null| -|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null| -|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null| -|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null| -|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null| -|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null| -|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null| -|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null| -|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null| -|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null| -|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)| -|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null| -|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null| -|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null| -|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)| -|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)| -|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null| -|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null| -|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null| -|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null| -|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null| -|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null| -|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null| -|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)| -|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null| -|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null| -|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null| - -#### Abstracts -##### **Towards a perturbation-based explanation for medical AI as differentiable programs** -2502.14001v1 by Takeshi Abe, Yoshiyuki Asai - -Recent advancement in machine learning algorithms reaches a point where -medical devices can be equipped with artificial intelligence (AI) models for -diagnostic support and routine automation in clinical settings. In medicine and -healthcare, there is a particular demand for sufficient and objective -explainability of the outcome generated by AI models. However, AI models are -generally considered as black boxes due to their complexity, and the -computational process leading to their response is often opaque. Although -several methods have been proposed to explain the behavior of models by -evaluating the importance of each feature in discrimination and prediction, -they may suffer from biases and opacities arising from the scale and sampling -protocol of the dataset used for training or testing. To overcome the -shortcomings of existing methods, we explore an alternative approach to provide -an objective explanation of AI models that can be defined independently of the -learning process and does not require additional data. As a preliminary study -for this direction of research, this work examines a numerical availability of -the Jacobian matrix of deep learning models that measures how stably a model -responses against small perturbations added to the input. The indicator, if -available, are calculated from a trained AI model for a given target input. -This is a first step towards a perturbation-based explanation, which will -assist medical practitioners in understanding and interpreting the response of -the AI model in its clinical application. - -摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 - -##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** -2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker - -Explainability remains a significant problem for AI models in medical -imaging, making it challenging for clinicians to trust AI-driven predictions. -We introduce 3D ReX, the first causality-based post-hoc explainability tool for -3D models. 3D ReX uses the theory of actual causality to generate -responsibility maps which highlight the regions most crucial to the model's -decision. We test 3D ReX on a stroke detection model, providing insight into -the spatial distribution of features relevant to stroke. - -摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 -我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 - -##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** -2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano - -This paper presents a complete explainable system that interprets a set of -data, abstracts the underlying features and describes them in a natural -language of choice. The system relies on two crucial stages: (i) identifying -emerging properties from data and transforming them into abstract concepts, and -(ii) converting these concepts into natural language. Despite the impressive -natural language generation capabilities demonstrated by Large Language Models, -their statistical nature and the intricacy of their internal mechanism still -force us to employ these techniques as black boxes, forgoing trustworthiness. -Developing an explainable pipeline for data interpretation would allow -facilitating its use in safety-critical environments like processing medical -information and allowing non-experts and visually impaired people to access -narrated information. To this end, we believe that the fields of knowledge -representation and automated reasoning research could present a valid -alternative. Expanding on prior research that tackled the first stage (i), we -focus on the second stage, named Concept2Text. Being explainable, data -translation is easily modeled through logic-based rules, once again emphasizing -the role of declarative programming in achieving AI explainability. This paper -explores a Prolog/CLP-based rewriting system to interpret concepts-articulated -in terms of classes and relations, plus common knowledge-derived from a generic -ontology, generating natural language text. Its main features include -hierarchical tree rewritings, modular multilingual generation, support for -equivalent variants across semantic, grammar, and lexical levels, and a -transparent rule-based system. We outline the architecture and demonstrate its -flexibility through some examples capable of generating numerous diverse and -equivalent rewritings based on the input concept. - -摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 - -##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** -2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek - -We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), -an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS -predicts future PHTs using transformer-based architectures. The Adaptive Risk -Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk -probabilities for clinician-defined critical events. ARES incorporates a -personalized explainability module that identifies key clinical factors -influencing risk estimates for individual patients. ARES was evaluated on the -MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its -performance against traditional early warning systems and machine learning -models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, -with 60% including hospital admissions. The dataset contained over 357 million -tokens. ETHOS outperformed benchmark models in predicting hospital admissions, -ICU admissions, and prolonged hospital stays, achieving superior AUC scores. -ETHOS-based risk estimates demonstrated robustness across demographic subgroups -with strong model reliability, confirmed via calibration curves. The -personalized explainability module provides insights into patient-specific -factors contributing to risk. ARES, powered by ETHOS, advances predictive -healthcare AI by providing dynamic, real-time, and personalized risk estimation -with patient-specific explainability to enhance clinician trust. Its -adaptability and superior accuracy position it as a transformative tool for -clinical decision-making, potentially improving patient outcomes and resource -allocation in emergency and inpatient settings. We release the full code at -github.com/ipolharvard/ethos-ares to facilitate future research. - -摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), -一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS -使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 - -##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases** -2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq - -This study addresses a critical gap in the healthcare system by developing a -clinically meaningful, practical, and explainable disease surveillance system -for multiple chronic diseases, utilizing routine EHR data from multiple U.S. -practices integrated with CureMD's EMR/EHR system. Unlike traditional -systems--using AI models that rely on features from patients' labs--our -approach focuses on routinely available data, such as medical history, vitals, -diagnoses, and medications, to preemptively assess the risks of chronic -diseases in the next year. We trained three distinct models for each chronic -disease: prediction models that forecast the risk of a disease 3, 6, and 12 -months before a potential diagnosis. We developed Random Forest models, which -were internally validated using F1 scores and AUROC as performance metrics and -further evaluated by a panel of expert physicians for clinical relevance based -on inferences grounded in medical knowledge. Additionally, we discuss our -implementation of integrating these models into a practical EMR system. Beyond -using Shapley attributes and surrogate models for explainability, we also -introduce a new rule-engineering framework to enhance the intrinsic -explainability of Random Forests. - -摘要:本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統,來解決醫療保健系統中的重大缺口,利用整合 CureMD 的 EMR/EHR 系統,來自多個美國實務的例行 EHR 資料。與傳統系統不同的是,我們的做法著重在例行可得的資料,例如病歷、生命徵象、診斷和藥物,以預先評估未來一年慢性疾病的風險,而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型:預測模型,用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型,並使用 F1 分數和 AUROC 作為效能指標,進行內部驗證,並進一步由專家醫師小組根據植基於醫學知識的推論,評估其臨床相關性。此外,我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外,我們還引進了一個新的規則工程架構,以增強隨機森林的內在可解釋性。 - -##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data** -2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek - -Deep neural networks are increasingly employed in high-stakes medical -applications, despite their tendency for shortcut learning in the presence of -spurious correlations, which can have potentially fatal consequences in -practice. Detecting and mitigating shortcut behavior is a challenging task that -often requires significant labeling efforts from domain experts. To alleviate -this problem, we introduce a semi-automated framework for the identification of -spurious behavior from both data and model perspective by leveraging insights -from eXplainable Artificial Intelligence (XAI). This allows the retrieval of -spurious data points and the detection of model circuits that encode the -associated prediction rules. Moreover, we demonstrate how these shortcut -encodings can be used for XAI-based sample- and pixel-level data annotation, -providing valuable information for bias mitigation methods to unlearn the -undesired shortcut behavior. We show the applicability of our framework using -four medical datasets across two modalities, featuring controlled and -real-world spurious correlations caused by data artifacts. We successfully -identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision -Transformer models, ultimately increasing their robustness and applicability -for real-world medical tasks. - -摘要:深度神经网络越来越多地用于高风险医疗应用中,尽管它们在存在虚假相关性的情况下倾向于捷径学习,这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务,通常需要领域专家的大量标记工作。为了缓解这个问题,我们引入了一个半自动框架,用于从数据和模型的角度识别虚假行为,方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外,我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释,为偏差缓解方法提供有价值的信息,以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性,这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差,最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。 - -##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model** -2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail - -Suicidal ideation detection is crucial for preventing suicides, a leading -cause of death worldwide. Many individuals express suicidal thoughts on social -media, offering a vital opportunity for early detection through advanced -machine learning techniques. The identification of suicidal ideation in social -media text is improved by utilising a hybrid framework that integrates -Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory -(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability -of the model's predictions, Explainable AI (XAI) methods are applied, with a -particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At -first, the model managed to reach an accuracy of 92.81%. By applying -fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The -SHAP analysis revealed key features influencing the model's predictions, such -as terms related to mental health struggles. This level of transparency boosts -the model's credibility while helping mental health professionals understand -and trust the predictions. This work highlights the potential for improving the -accuracy and interpretability of detecting suicidal tendencies, making a -valuable contribution to the progress of mental health monitoring systems. It -emphasizes the significance of blending powerful machine learning methods with -explainability to develop reliable and impactful mental health solutions. - -摘要:自殺意念偵測對於預防自殺至關重要,而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭,這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構,並加入注意力機制,可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性,我們採用可解釋人工智慧 (XAI) 方法,特別著重於 SHapley 加法解釋 (SHAP)。一開始,模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術,準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵,例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度,同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力,為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。 - -##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights** -2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet - -In epidemiology, traditional statistical methods such as logistic regression, -linear regression, and other parametric models are commonly employed to -investigate associations between predictors and health outcomes. However, -non-parametric machine learning techniques, such as deep neural networks -(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for -this task. Despite their potential, these methods face challenges due to the -limited availability of high-quality, high-quantity data in this field. To -address these challenges, we introduce SEANN, a novel approach for informed -DNNs that leverages a prevalent form of domain-specific knowledge: Pooled -Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies, -in different forms, and represent a quantitative form of a scientific -consensus. By direct integration within the learning procedure using a custom -loss, we experimentally demonstrate significant improvements in the -generalizability of predictive performances and the scientific plausibility of -extracted relationships compared to a domain-knowledge agnostic neural network -in a scarce and noisy data setting. - -摘要:在流行病學中,傳統的統計方法,例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而,非參數機器學習技術,例如深度神經網路 (DNN),結合可解釋的 AI (XAI) 工具,為這項任務提供了新的機會。儘管這些方法具有潛力,但由於該領域缺乏高品質、高數量資料,因此這些方法面臨挑戰。為了應對這些挑戰,我們引入了 SEANN,這是一種新穎的方法,用於獲取知識的 DNN,它利用了一種流行的領域特定知識形式:彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中,並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中,我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比,科學合理性的顯著提升,且是在稀少且有雜訊的資料設定中。 - -##### **Artificial Intelligence-Driven Clinical Decision Support Systems** -2501.09628v2 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni - -As artificial intelligence (AI) becomes increasingly embedded in healthcare -delivery, this chapter explores the critical aspects of developing reliable and -ethical Clinical Decision Support Systems (CDSS). Beginning with the -fundamental transition from traditional statistical models to sophisticated -machine learning approaches, this work examines rigorous validation strategies -and performance assessment methods, including the crucial role of model -calibration and decision curve analysis. The chapter emphasizes that creating -trustworthy AI systems in healthcare requires more than just technical -accuracy; it demands careful consideration of fairness, explainability, and -privacy. The challenge of ensuring equitable healthcare delivery through AI is -stressed, discussing methods to identify and mitigate bias in clinical -predictive models. The chapter then delves into explainability as a cornerstone -of human-centered CDSS. This focus reflects the understanding that healthcare -professionals must not only trust AI recommendations but also comprehend their -underlying reasoning. The discussion advances in an analysis of privacy -vulnerabilities in medical AI systems, from data leakage in deep learning -models to sophisticated attacks against model explanations. The text explores -privacy-preservation strategies such as differential privacy and federated -learning, while acknowledging the inherent trade-offs between privacy -protection and model performance. This progression, from technical validation -to ethical considerations, reflects the multifaceted challenges of developing -AI systems that can be seamlessly and reliably integrated into daily clinical -practice while maintaining the highest standards of patient care and data -protection. - -摘要:隨著人工智慧(AI)在醫療保健服務中日益普及,本章探討了開發可靠且符合道德的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型轉變到複雜機器學習方法的基本原理開始,這項工作探討了嚴謹的驗證策略和效能評估方法,包括模型校準和決策曲線分析的關鍵角色。本章強調,在醫療保健中建立值得信賴的 AI 系統不僅需要技術準確性;它需要仔細考量公平性、可解釋性和隱私。本章強調了透過 AI 確保公平醫療保健服務的挑戰,並討論了識別和減輕臨床預測模型中偏差的方法。接著,本章深入探討可解釋性作為以人為中心的 CDSS 的基石。這種關注反映了對醫療保健專業人員不僅必須信任 AI 建議,還必須理解其背後推理的理解。討論進展到對醫療 AI 系統中隱私漏洞的分析,從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略,例如差分隱私和聯合學習,同時承認隱私保護和模型效能之間的固有權衡。從技術驗證到道德考量,這種進展反映了開發 AI 系統的多方面挑戰,這些系統可以無縫且可靠地整合到日常臨床實務中,同時維持最高標準的患者照護和資料保護。 - -##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis** -2501.06887v1 by Sadia Kamal, Tim Oates - -As deep learning models gain attraction in medical data, ensuring transparent -and trustworthy decision-making is essential. In skin cancer diagnosis, while -advancements in lesion detection and classification have improved accuracy, the -black-box nature of these methods poses challenges in understanding their -decision processes, leading to trust issues among physicians. This study -leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on -different skin lesion datasets, to capture meaningful relationships between -visual features and diagnostic criteria terms. To further enhance transparency, -we propose a method called MedGrad E-CLIP, which builds on gradient-based -E-CLIP by incorporating a weighted entropy mechanism designed for complex -medical imaging like skin lesions. This approach highlights critical image -regions linked to specific diagnostic descriptions. The developed integrated -pipeline not only classifies skin lesions by matching corresponding -descriptions but also adds an essential layer of explainability developed -especially for medical data. By visually explaining how different features in -an image relates to diagnostic criteria, this approach demonstrates the -potential of advanced vision-language models in medical image analysis, -ultimately improving transparency, robustness, and trust in AI-driven -diagnostic systems. - -摘要:随着深度学习模型在医学数据中获得关注,确保透明且值得信赖的决策至关重要。在皮肤癌诊断中,虽然病灶检测和分类的进步提高了准确性,但这些方法的黑盒性质对理解其决策过程构成了挑战,导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP(对比语言图像预训练)模型,以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度,我们提出了一种名为 MedGrad E-CLIP 的方法,该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制,建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类,还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系,这种方法展示了高级视觉语言模型在医学图像分析中的潜力,最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。 - -##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis** -2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat - -Humour styles can have either a negative or a positive impact on well-being. -Given the importance of these styles to mental health, significant research has -been conducted on their automatic identification. However, the automated -machine learning models used for this purpose are black boxes, making their -prediction decisions opaque. Clarity and transparency are vital in the field of -mental health. This paper presents an explainable AI (XAI) framework for -understanding humour style classification, building upon previous work in -computational humour analysis. Using the best-performing single model -(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to -analyse how linguistic, emotional, and semantic features contribute to humour -style classification decisions. Our analysis reveals distinct patterns in how -different humour styles are characterised and misclassified, with particular -emphasis on the challenges in distinguishing affiliative humour from other -styles. Through detailed examination of feature importance, error patterns, and -misclassification cases, we identify key factors influencing model decisions, -including emotional ambiguity, context misinterpretation, and target -identification. The framework demonstrates significant utility in understanding -model behaviour, achieving interpretable insights into the complex interplay of -features that define different humour styles. Our findings contribute to both -the theoretical understanding of computational humour analysis and practical -applications in mental health, content moderation, and digital humanities -research. - -摘要:幽默風格對幸福感可能產生負面或正面的影響。 -鑑於這些風格對心理健康的重要性,已經對其自動識別進行了大量研究。然而,用於此目的的自動機器學習模型是黑盒子,使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架,用於理解幽默風格分類,建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost),我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式,特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例,我們確定了影響模型決策的關鍵因素,包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用,實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。 - -##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support** -2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani - -The increasing demand for mental health services has highlighted the need for -innovative solutions, particularly in the realm of psychological conversational -AI, where the availability of sensitive data is scarce. In this work, we -explored the development of a system tailored for mental health support with a -novel approach to psychological assessment based on explainable emotional -profiles in combination with empathetic conversational models, offering a -promising tool for augmenting traditional care, particularly where immediate -expertise is unavailable. Our work can be divided into two main parts, -intrinsecaly connected to each other. First, we present RACLETTE, a -conversational system that demonstrates superior emotional accuracy compared to -state-of-the-art benchmarks in both understanding users' emotional states and -generating empathetic responses during conversations, while progressively -building an emotional profile of the user through their interactions. Second, -we show how the emotional profiles of a user can be used as interpretable -markers for mental health assessment. These profiles can be compared with -characteristic emotional patterns associated with different mental disorders, -providing a novel approach to preliminary screening and support. - -摘要:隨著對心理健康服務需求的增加,凸顯了創新解決方案的需求,特別是在心理對話式人工智慧領域,那裡缺乏敏感資料。在這項工作中,我們探索了開發一個針對心理健康支持的系統,採用一種基於可解釋的情緒特徵的新方法進行心理評估,結合同理心對話模式,提供了一個有前途的工具,用於擴充傳統照護,特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分,彼此內在相關。首先,我們展示了 RACLETTE,一個對話系統,與最先進的基準相比,在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性,同時透過他們的互動逐漸建立使用者的情緒特徵。其次,我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較,提供了一種初步篩選和支持的新方法。 - -##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation** -2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia - -Artificial intelligence (AI) has emerged as a powerful tool to enhance -decision-making and optimize treatment protocols in in vitro fertilization -(IVF). In particular, AI shows significant promise in supporting -decision-making during the ovarian stimulation phase of the IVF process. This -review evaluates studies focused on the applications of AI combined with -medical imaging in ovarian stimulation, examining methodologies, outcomes, and -current limitations. Our analysis of 13 studies on this topic reveals that, -reveal that while AI algorithms demonstrated notable potential in predicting -optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the -medical imaging data utilized predominantly came from two-dimensional (2D) -ultrasound which mainly involved basic quantifications, such as follicle size -and number, with limited use of direct feature extraction or advanced image -analysis techniques. This points to an underexplored opportunity where advanced -image analysis approaches, such as deep learning, and more diverse imaging -modalities, like three-dimensional (3D) ultrasound, could unlock deeper -insights. Additionally, the lack of explainable AI (XAI) in most studies raises -concerns about the transparency and traceability of AI-driven decisions - key -factors for clinical adoption and trust. Furthermore, many studies relied on -single-center designs and small datasets, which limit the generalizability of -their findings. This review highlights the need for integrating advanced -imaging analysis techniques with explainable AI methodologies, as well as the -importance of leveraging multicenter collaborations and larger datasets. -Addressing these gaps has the potential to enhance ovarian stimulation -management, paving the way for efficient, personalized, and data-driven -treatment pathways that improve IVF outcomes. - -摘要:人工智慧(AI)已成為增強體外受精(IVF)決策制定和優化治療方案的強大工具。特別是,AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示,雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力,但所利用的醫學影像數據主要來自於二次元(2D)超音波,而二次元超音波主要涉及基本量化,例如濾泡大小和數量,且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會,例如深度學習等進階影像分析方法,以及更多元的影像模式,例如三維(3D)超音波,可以解鎖更深入的見解。此外,大多數研究缺乏可解釋 AI(XAI),這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂,而透明度和可追溯性是臨床採用和信任的關鍵因素。此外,許多研究依賴於單中心設計和小型數據集,這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性,以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理,為有效、個人化和數據驅動的治療途徑鋪平道路,進而改善 IVF 結果。 - -##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models** -2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman - -This research presents an innovative approach to cancer diagnosis and -prediction using explainable Artificial Intelligence (XAI) and deep learning -techniques. With cancer causing nearly 10 million deaths globally in 2020, -early and accurate diagnosis is crucial. Traditional methods often face -challenges in cost, accuracy, and efficiency. Our study develops an AI model -that provides precise outcomes and clear insights into its decision-making -process, addressing the "black box" problem of deep learning models. By -employing XAI techniques, we enhance interpretability and transparency, -building trust among healthcare professionals and patients. Our approach -leverages neural networks to analyse extensive datasets, identifying patterns -for cancer detection. This model has the potential to revolutionise diagnosis -by improving accuracy, accessibility, and clarity in medical decision-making, -possibly leading to earlier detection and more personalised treatment -strategies. Furthermore, it could democratise access to high-quality -diagnostics, particularly in resource-limited settings, contributing to global -health equity. The model's applications extend beyond cancer diagnosis, -potentially transforming various aspects of medical decision-making and saving -millions of lives worldwide. - -摘要:本研究提出了一個創新的癌症診斷和預測方法,使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡,因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型,它提供精確的結果並清楚地了解其決策過程,解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術,我們增強了解釋性和透明度,在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集,識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷,可能導致更早的檢測和更個性化的治療策略。此外,它可以使更多人獲得高品質的診斷,特別是在資源有限的環境中,有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷,還可能轉變醫療決策的各個方面,並拯救全球數百萬人的生命。 - -##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG** -2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag - -Deep learning has advanced medical image classification, but interpretability -challenges hinder its clinical adoption. This study enhances interpretability -in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs) -and a multi-agent Retrieval-Augmented Generation (RAG) system for report -generation. By modeling relationships between visual features and clinical -concepts, we create interpretable concept vectors that guide a multi-agent RAG -system to generate radiology reports, enhancing clinical relevance, -explainability, and transparency. Evaluation of the generated reports using an -LLM-as-a-judge confirmed the interpretability and clinical utility of our -model's outputs. On the COVID-QU dataset, our model achieved 81% classification -accuracy and demonstrated robust report generation performance, with five key -metrics ranging between 84% and 90%. This interpretable multi-agent framework -bridges the gap between high-performance AI and the explainability required for -reliable AI-driven CXR analysis in clinical settings. Our code is available at -https://github.com/tifat58/IRR-with-CBM-RAG.git. - -摘要:深度學習已提升醫學影像分類,但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成,來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係,我們建立可解釋的概念向量,引導多代理 RAG 系統生成放射報告,增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估,確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上,我們的模型達到了 81% 的分類準確率,並展示了穩健的報告生成效能,五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。 - -##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models** -2412.15748v1 by Shamus Sim, Tyrone Chen - -Background: Despite the current ubiquity of Large Language Models (LLMs) -across the medical domain, there is a surprising lack of studies which address -their reasoning behaviour. We emphasise the importance of understanding -reasoning behaviour as opposed to high-level prediction accuracies, since it is -equivalent to explainable AI (XAI) in this context. In particular, achieving -XAI in medical LLMs used in the clinical domain will have a significant impact -across the healthcare sector. Results: Therefore, we define the concept of -reasoning behaviour in the specific context of medical LLMs. We then categorise -and discuss the current state of the art of methods which evaluate reasoning -behaviour in medical LLMs. Finally, we propose theoretical frameworks which can -empower medical professionals or machine learning engineers to gain insight -into the low-level reasoning operations of these previously obscure models. -Conclusion: The subsequent increased transparency and trust in medical machine -learning models by clinicians as well as patients will accelerate the -integration, application as well as further development of medical AI for the -healthcare system as a whole - -摘要:背景:儘管大型語言模型 (LLM) 目前在醫療領域無所不在,但令人驚訝的是,探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要,因為在這種情況下,這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI,將對整個醫療保健產業產生重大影響。結果:因此,我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後,我們提出理論架構,讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論:臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升,將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。 - -##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media** -2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton - -Stress is a pervasive global health issue that can lead to severe mental -health problems. Early detection offers timely intervention and prevention of -stress-related disorders. The current early detection models perform "black -box" inference suffering from limited explainability and trust which blocks the -real-world clinical application. Thanks to the generative properties introduced -by the Large Language Models (LLMs), the decision and the prediction from such -models are semi-interpretable through the corresponding description. However, -the existing LLMs are mostly trained for general purposes without the guidance -of psychological cognitive theory. To this end, we first highlight the -importance of prior theory with the observation of performance boosted by the -chain-of-thoughts tailored for stress detection. This method termed Cognition -Chain explicates the generation of stress through a step-by-step cognitive -perspective based on cognitive appraisal theory with a progress pipeline: -Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress -State, guiding LLMs to provide comprehensive reasoning explanations. We further -study the benefits brought by the proposed Cognition Chain format by utilising -it as a synthetic dataset generation template for LLMs instruction-tuning and -introduce CogInstruct, an instruction-tuning dataset for stress detection. This -dataset is developed using a three-stage self-reflective annotation pipeline -that enables LLMs to autonomously generate and refine instructional data. By -instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable -stress detection model. Evaluations demonstrate that CogLLM achieves -outstanding performance while enhancing explainability. Our work contributes a -novel approach by integrating cognitive theories into LLM reasoning processes, -offering a promising direction for future explainable AI research. - -摘要:壓力是一個普遍的全球性健康問題,可能會導致嚴重的精神 -健康問題。早期發現提供及時的干預和預防 -壓力相關疾病。目前的早期發現模型執行「黑 -盒子」推論,存在可解釋性和信任度有限的問題,阻礙了 -現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性,此類 -模型的決策和預測通過對應描述具有半可解釋性。然而, -現有的 LLM 主要針對一般用途進行訓練,沒有心理認知理論的指導。為此,我們首先強調 -先驗理論的重要性,並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知 -鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生,並具有進度管道: -刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力 -狀態,指導 LLM 提供全面的推理解釋。我們進一步 -通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點,並介紹 CogInstruct,這是一個針對壓力檢測的指令調整數據集。這個 -數據集是使用一個三階段的自省標註管道開發的,使 LLM 能夠自主生成和優化指令數據。通過 -使用 CogInstruct 對 Llama3 進行指令調整,我們開發了 CogLLM,這是一個可解釋的 -壓力檢測模型。評估表明,CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中,提出了一種新穎的方法, -為未來的可解釋人工智能研究提供了一個有希望的方向。 - -##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology** -2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi - -Human-machine teaming in medical AI requires us to understand to what degree -a trained clinician should weigh AI predictions. While previous work has shown -the potential of AI assistance at improving clinical predictions, existing -clinical decision support systems either provide no explainability of their -predictions or use techniques like saliency and Shapley values, which do not -allow for physician-based verification. To address this gap, this study -compares previously used explainable AI techniques with a newly proposed -technique termed '2-factor retrieval (2FR)', which is a combination of -interface design and search retrieval that returns similarly labeled data -without processing this data. This results in a 2-factor security blanket -where: (a) correct images need to be retrieved by the AI; and (b) humans should -associate the retrieved images with the current pathology under test. We find -that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician -accuracy, with particular improvements when clinicians are radiologists and -have low confidence in their decision. Our results highlight the importance of -understanding how different modes of human-AI decision making may impact -clinician accuracy in clinical decision support systems. - -摘要:人機協作在醫療 AI 中,需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力,但現有的臨床決策支援系統,要不就沒有提供預測的可解釋性,要不就是使用像顯著性和 Shapley 值之類的技術,這些技術不允許基於醫生的驗證。為了解決這個差距,本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較,後者是一種介面設計和搜尋檢索的組合,它會傳回標籤相似的資料,而不會處理這些資料。這會產生一個 2 因子安全機制,其中:(a) 正確的影像需要由 AI 檢索;(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現,當在胸部 X 光診斷上進行測試時,2FR 會提高臨床醫生的準確度,特別是在臨床醫生是放射科醫生且對其決策信心不足時,會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。 - -##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance** -2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle - -Understanding public perception of artificial intelligence (AI) and the -tradeoffs between potential risks and benefits is crucial, as these perceptions -might shape policy decisions, influence innovation trajectories for successful -market strategies, and determine individual and societal acceptance of AI -technologies. Using a representative sample of 1100 participants from Germany, -this study examines mental models of AI. Participants quantitatively evaluated -71 statements about AI's future capabilities (e.g., autonomous driving, medical -care, art, politics, warfare, and societal divides), assessing the expected -likelihood of occurrence, perceived risks, benefits, and overall value. We -present rankings of these projections alongside visual mappings illustrating -public risk-benefit tradeoffs. While many scenarios were deemed likely, -participants often associated them with high risks, limited benefits, and low -overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in -value assessment can be explained by perceived risks ($\beta=-.504$) and -perceived benefits ($\beta=+.710$), with no significant relation to expected -likelihood. Demographics and personality traits influenced perceptions of -risks, benefits, and overall evaluations, underscoring the importance of -increasing AI literacy and tailoring public information to diverse user needs. -These findings provide actionable insights for researchers, developers, and -policymakers by highlighting critical public concerns and individual factors -essential to align AI development with individual values. - -摘要:了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要,因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡,並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本,探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述(例如,自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧)進行了定量評估,評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名,並附上視覺化映射,說明了公眾的風險收益權衡。儘管許多場景被認為是可能的,但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中,96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋,與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法,這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素,為研究人員、開發人員和政策制定者提供了可行的見解。 - -##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset** -2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey - -The use of machine learning and AI on electronic health records (EHRs) holds -substantial potential for clinical insight. However, this approach faces -challenges due to data heterogeneity, sparsity, temporal misalignment, and -limited labeled outcomes. In this context, we leverage a linked EHR dataset of -approximately one million de-identified individuals from Bristol, North -Somerset, and South Gloucestershire, UK, to characterize urinary tract -infections (UTIs). We implemented a data pre-processing and curation pipeline -that transforms the raw EHR data into a structured format suitable for -developing predictive models focused on data fairness, accountability and -transparency. Given the limited availability and biases of ground truth UTI -outcomes, we introduce a UTI risk estimation framework informed by clinical -expertise to estimate UTI risk across individual patient timelines. Pairwise -XGBoost models are trained using this framework to differentiate UTI risk -categories with explainable AI techniques applied to identify key predictors -and support interpretability. Our findings reveal differences in clinical and -demographic predictors across risk groups. While this study highlights the -potential of AI-driven insights to support UTI clinical decision-making, -further investigation of patient sub-strata and extensive validation are needed -to ensure robustness and applicability in clinical practice. - -摘要:電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而,由於資料異質性、稀疏性、時間錯位和標籤結果有限,此方法面臨挑戰。在此背景下,我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集,來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線,適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差,我們引入了由臨床專業知識告知的 UTI 風險評估架構,以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練,以區分 UTI 風險類別,並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力,但仍需要進一步調查患者子群體和廣泛驗證,以確保在臨床實務中的穩健性和適用性。 - -##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care** -2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez - -There is a growing need to understand how digital systems can support -clinical decision-making, particularly as artificial intelligence (AI) models -become increasingly complex and less human-interpretable. This complexity -raises concerns about trustworthiness, impacting safe and effective adoption of -such technologies. Improved understanding of decision-making processes and -requirements for explanations coming from decision support tools is a vital -component in providing effective explainable solutions. This is particularly -relevant in the data-intensive, fast-paced environments of intensive care units -(ICUs). To explore these issues, group interviews were conducted with seven ICU -clinicians, representing various roles and experience levels. Thematic analysis -revealed three core themes: (T1) ICU decision-making relies on a wide range of -factors, (T2) the complexity of patient state is challenging for shared -decision-making, and (T3) requirements and capabilities of AI decision support -systems. We include design recommendations from clinical input, providing -insights to inform future AI systems for intensive care. - -摘要:隨著人工智慧 (AI) 模型變得越來越複雜,且越來越難以被人理解,了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮,影響了此類技術的安全且有效採用。改善對決策制定流程的理解,以及對決策支援工具所提供說明的要求,是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題,對七位 ICU 臨床醫師進行了小組訪談,這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題:(T1) ICU 決策制定依賴於廣泛的因素,(T2) 病患狀態的複雜性對共同決策制定構成挑戰,以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議,提供見解以提供資訊給未來用於加護的 AI 系統。 - -##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning** -2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra - -Pediatric heart diseases present a broad spectrum of congenital and acquired -diseases. More complex congenital malformations require a differentiated and -multimodal decision-making process, usually including echocardiography as a -central imaging method. Artificial intelligence (AI) offers considerable -promise for clinicians by facilitating automated interpretation of pediatric -echocardiography data. However, adapting AI technologies for pediatric -echocardiography analysis has challenges such as limited public data -availability, data privacy, and AI model transparency. Recently, researchers -have focused on disruptive technologies, such as federated learning (FL) and -explainable AI (XAI), to improve automatic diagnostic and decision support -workflows. This study offers a comprehensive overview of the limitations and -opportunities of AI in pediatric echocardiography, emphasizing the synergistic -workflow and role of XAI and FL, identifying research gaps, and exploring -potential future developments. Additionally, three relevant clinical use cases -demonstrate the functionality of XAI and FL with a focus on (i) view -recognition, (ii) disease classification, (iii) segmentation of cardiac -structures, and (iv) quantitative assessment of cardiac function. - -摘要:小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程,通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望,因為它可以促進小兒超音波檢查資料的自動化解讀。然而,將人工智慧技術應用於小兒超音波檢查分析有許多挑戰,例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近,研究人員專注於破壞性技術,例如聯合學習 (FL) 和可解釋人工智慧 (XAI),以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述,強調了 XAI 和 FL 的協同工作流程和角色,找出研究差距並探討潛在的未來發展。此外,三個相關的臨床使用案例展示了 XAI 和 FL 的功能,重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。 - -##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering** -2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust - -Osteoporosis is a common condition that increases fracture risk, especially -in older adults. Early diagnosis is vital for preventing fractures, reducing -treatment costs, and preserving mobility. However, healthcare providers face -challenges like limited labeled data and difficulties in processing medical -images. This study presents a novel multi-modal learning framework that -integrates clinical and imaging data to improve diagnostic accuracy and model -interpretability. The model utilizes three pre-trained networks-VGG19, -InceptionV3, and ResNet50-to extract deep features from X-ray images. These -features are transformed using PCA to reduce dimensionality and focus on the -most relevant components. A clustering-based selection process identifies the -most representative components, which are then combined with preprocessed -clinical data and processed through a fully connected network (FCN) for final -classification. A feature importance plot highlights key variables, showing -that Medical History, BMI, and Height were the main contributors, emphasizing -the significance of patient-specific data. While imaging features were -valuable, they had lower importance, indicating that clinical data are crucial -for accurate predictions. This framework promotes precise and interpretable -predictions, enhancing transparency and building trust in AI-driven diagnoses -for clinical integration. - -摘要:骨質疏鬆症是一種常見的疾病,會增加骨折的風險,特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而,醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架,該框架整合了臨床和影像數據,以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路,VGG19、InceptionV3 和 ResNet50,從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分,然後將這些組成部分與預處理的臨床數據結合,並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數,表明病史、BMI 和身高是主要貢獻因素,強調了患者特定數據的重要性。雖然影像特徵很有價值,但它們的重要性較低,這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測,提高了透明度,並建立了對 AI 驅動診斷在臨床整合中的信任。 - -##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection** -2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor - -This review paper explores recent advances in deep learning approaches for -non-invasive cognitive impairment detection. We examine various non-invasive -indicators of cognitive decline, including speech and language, facial, and -motoric mobility. The paper provides an overview of relevant datasets, -feature-extracting techniques, and deep-learning architectures applied to this -domain. We have analyzed the performance of different methods across modalities -and observed that speech and language-based methods generally achieved the -highest detection performance. Studies combining acoustic and linguistic -features tended to outperform those using a single modality. Facial analysis -methods showed promise for visual modalities but were less extensively studied. -Most papers focused on binary classification (impaired vs. non-impaired), with -fewer addressing multi-class or regression tasks. Transfer learning and -pre-trained language models emerged as popular and effective techniques, -especially for linguistic analysis. Despite significant progress, several -challenges remain, including data standardization and accessibility, model -explainability, longitudinal analysis limitations, and clinical adaptation. -Lastly, we propose future research directions, such as investigating -language-agnostic speech analysis methods, developing multi-modal diagnostic -systems, and addressing ethical considerations in AI-assisted healthcare. By -synthesizing current trends and identifying key obstacles, this review aims to -guide further development of deep learning-based cognitive impairment detection -systems to improve early diagnosis and ultimately patient outcomes. - -摘要:本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標,包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現,並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力,但研究較少。大多數論文專注於二元分類(受損與未受損),較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術,特別是對於語言分析。儘管取得了重大進展,但仍存在一些挑戰,包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後,我們提出了未來的研究方向,例如調查與語言無關的語音分析方法、開發多模式診斷系統,以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙,本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展,以改善早期診斷,並最終改善患者的治療結果。 - -##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems** -2410.17504v1 by Shruthi Chari - -Explainable Artificial Intelligence (AI) focuses on helping humans understand -the working of AI systems or their decisions and has been a cornerstone of AI -for decades. Recent research in explainability has focused on explaining the -workings of AI models or model explainability. There have also been several -position statements and review papers detailing the needs of end-users for -user-centered explainability but fewer implementations. Hence, this thesis -seeks to bridge some gaps between model and user-centered explainability. We -create an explanation ontology (EO) to represent literature-derived explanation -types via their supporting components. We implement a knowledge-augmented -question-answering (QA) pipeline to support contextual explanations in a -clinical setting. Finally, we are implementing a system to combine explanations -from different AI methods and data modalities. Within the EO, we can represent -fifteen different explanation types, and we have tested these representations -in six exemplar use cases. We find that knowledge augmentations improve the -performance of base large language models in the contextualized QA, and the -performance is variable across disease groups. In the same setting, clinicians -also indicated that they prefer to see actionability as one of the main foci in -explanations. In our explanations combination method, we plan to use similarity -metrics to determine the similarity of explanations in a chronic disease -detection setting. Overall, through this thesis, we design methods that can -support knowledge-enabled explanations across different use cases, accounting -for the methods in today's AI era that can generate the supporting components -of these explanations and domain knowledge sources that can enhance them. - -摘要:可解釋人工智慧(AI)專注於協助人類了解 AI 系統運作或其決策,數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求,但實作較少。因此,本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體(EO)以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答(QA)管線,以在臨床環境中支援情境解釋。最後,我們正在實作一個系統,以結合來自不同 AI 方法和資料模式的解釋。在 EO 中,我們可以表示 15 種不同的解釋類型,並且我們已在六個範例使用案例中測試這些表示。我們發現,知識增強改善了基礎大型語言模型在情境化 QA 中的效能,並且效能因疾病群組而異。在相同的環境中,臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中,我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言,透過本論文,我們設計了可以在不同使用案例中支援知識啟用解釋的方法,考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。 - -##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study** -2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay - -Objectives: To investigate clinicians' attitudes towards current automated -interpretation of ECG and novel AI technologies and their perception of -computer-assisted interpretation. Materials and Methods: We conducted a series -of interviews with clinicians in the UK. Our study: (i) explores the potential -for AI, specifically future 'human-like' computing approaches, to facilitate -ECG interpretation and support clinical decision making, and (ii) elicits their -opinions about the importance of explainability and trustworthiness of AI -algorithms. Results: We performed inductive thematic analysis on interview -transcriptions from 23 clinicians and identified the following themes: (i) a -lack of trust in current systems, (ii) positive attitudes towards future AI -applications and requirements for these, (iii) the relationship between the -accuracy and explainability of algorithms, and (iv) opinions on education, -possible deskilling, and the impact of AI on clinical competencies. Discussion: -Clinicians do not trust current computerised methods, but welcome future 'AI' -technologies. Where clinicians trust future AI interpretation to be accurate, -they are less concerned that it is explainable. They also preferred ECG -interpretation that demonstrated the results of the algorithm visually. Whilst -clinicians do not fear job losses, they are concerned about deskilling and the -need to educate the workforce to use AI responsibly. Conclusion: Clinicians are -positive about the future application of AI in clinical decision-making. -Accuracy is a key factor of uptake and visualisations are preferred over -current computerised methods. This is viewed as a potential means of training -and upskilling, in contrast to the deskilling that automation might be -perceived to bring. - -摘要:目的:調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度,以及他們對電腦輔助解讀的看法。材料和方法:我們對英國的臨床醫生進行了一系列訪談。我們的研究:(i) 探討人工智慧的潛力,特別是未來的「類人類」運算方法,以促進心電圖解讀並支持臨床決策制定,以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果:我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析,並找出以下主題:(i) 對目前系統缺乏信任,(ii) 對未來人工智慧應用和對這些應用的要求持正面態度,(iii) 演算法的準確性和可解釋性之間的關係,以及 (iv) 對教育、可能的技能退化,以及人工智慧對臨床能力的影響的看法。討論:臨床醫生不信任目前的電腦化方法,但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下,他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業,但他們擔心技能退化,以及需要教育員工負責任地使用人工智慧。結論:臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素,而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法,與自動化可能帶來的技能退化形成對比。 - -##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer** -2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker - -The aggressiveness of prostate cancer, the most common cancer in men -worldwide, is primarily assessed based on histopathological data using the -Gleason scoring system. While artificial intelligence (AI) has shown promise in -accurately predicting Gleason scores, these predictions often lack inherent -explainability, potentially leading to distrust in human-machine interactions. -To address this issue, we introduce a novel dataset of 1,015 tissue microarray -core images, annotated by an international group of 54 pathologists. The -annotations provide detailed localized pattern descriptions for Gleason grading -in line with international guidelines. Utilizing this dataset, we develop an -inherently explainable AI system based on a U-Net architecture that provides -predictions leveraging pathologists' terminology. This approach circumvents -post-hoc explainability methods while maintaining or exceeding the performance -of methods trained directly for Gleason pattern segmentation (Dice score: 0.713 -$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason -patterns). By employing soft labels during training, we capture the intrinsic -uncertainty in the data, yielding strong results in Gleason pattern -segmentation even in the context of high interobserver variability. With the -release of this dataset, we aim to encourage further research into segmentation -in medical tasks with high levels of subjectivity and to advance the -understanding of pathologists' reasoning processes. +|**2025-02-20**|**LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention**|Shang Yang et.al.|[2502.14866v1](http://arxiv.org/abs/2502.14866v1)|null| +|**2025-02-20**|**Interpretable Text Embeddings and Text Similarity Explanation: A Primer**|Juri Opitz et.al.|[2502.14862v1](http://arxiv.org/abs/2502.14862v1)|null| +|**2025-02-20**|**Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning**|Shuyue Stella Li et.al.|[2502.14860v1](http://arxiv.org/abs/2502.14860v1)|null| +|**2025-02-20**|**FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling**|Weilin Zhao et.al.|[2502.14856v1](http://arxiv.org/abs/2502.14856v1)|null| +|**2025-02-20**|**Prompt-to-Leaderboard**|Evan Frick et.al.|[2502.14855v1](http://arxiv.org/abs/2502.14855v1)|null| +|**2025-02-20**|**CLIPPER: Compression enables long-context synthetic data generation**|Chau Minh Pham et.al.|[2502.14854v1](http://arxiv.org/abs/2502.14854v1)|null| +|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| +|**2025-02-20**|**Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation**|Yue Yang et.al.|[2502.14846v1](http://arxiv.org/abs/2502.14846v1)|null| +|**2025-02-20**|**Revealing and Mitigating Over-Attention in Knowledge Editing**|Pinzheng Wang et.al.|[2502.14838v1](http://arxiv.org/abs/2502.14838v1)|null| +|**2025-02-20**|**Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs**|Tao Ji et.al.|[2502.14837v1](http://arxiv.org/abs/2502.14837v1)|null| +|**2025-02-20**|**LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models**|Shangqing Tu et.al.|[2502.14834v1](http://arxiv.org/abs/2502.14834v1)|null| +|**2025-02-20**|**Improving the Diffusability of Autoencoders**|Ivan Skorokhodov et.al.|[2502.14831v1](http://arxiv.org/abs/2502.14831v1)|null| +|**2025-02-20**|**Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs**|Danni Liu et.al.|[2502.14830v1](http://arxiv.org/abs/2502.14830v1)|null| +|**2025-02-20**|**Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps**|Martin Tutek et.al.|[2502.14829v1](http://arxiv.org/abs/2502.14829v1)|null| +|**2025-02-20**|**Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison**|Aiswarya Baby et.al.|[2502.14827v1](http://arxiv.org/abs/2502.14827v1)|null| +|**2025-02-20**|**eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables**|Luis Antonio Gutiérrez Guanilo et.al.|[2502.14820v1](http://arxiv.org/abs/2502.14820v1)|null| +|**2025-02-20**|**Optimizing Model Selection for Compound AI Systems**|Lingjiao Chen et.al.|[2502.14815v1](http://arxiv.org/abs/2502.14815v1)|null| +|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| +|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| +|**2025-02-20**|**A Survey on Text-Driven 360-Degree Panorama Generation**|Hai Wang et.al.|[2502.14799v1](http://arxiv.org/abs/2502.14799v1)|null| +|**2025-02-20**|**Rapid Word Learning Through Meta In-Context Learning**|Wentao Wang et.al.|[2502.14791v1](http://arxiv.org/abs/2502.14791v1)|null| +|**2025-02-20**|**SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features**|Michael Tschannen et.al.|[2502.14786v1](http://arxiv.org/abs/2502.14786v1)|[link](https://github.com/google-research/big_vision)| +|**2025-02-20**|**ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting**|Abhijit Mishra et.al.|[2502.14780v1](http://arxiv.org/abs/2502.14780v1)|null| +|**2025-02-20**|**Harnessing PDF Data for Improving Japanese Large Multimodal Models**|Jeonghun Baek et.al.|[2502.14778v1](http://arxiv.org/abs/2502.14778v1)|null| +|**2025-02-20**|**Making Universal Policies Universal**|Niklas Höpner et.al.|[2502.14777v1](http://arxiv.org/abs/2502.14777v1)|null| +|**2025-02-20**|**SurveyX: Academic Survey Automation via Large Language Models**|Xun Liang et.al.|[2502.14776v1](http://arxiv.org/abs/2502.14776v1)|null| +|**2025-02-20**|**Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning**|Tian Xie et.al.|[2502.14768v1](http://arxiv.org/abs/2502.14768v1)|null| +|**2025-02-20**|**Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis**|Priyanka Kargupta et.al.|[2502.14767v1](http://arxiv.org/abs/2502.14767v1)|null| +|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| +|**2025-02-20**|**EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations**|Haotian Zhai et.al.|[2502.14760v1](http://arxiv.org/abs/2502.14760v1)|null| +|**2025-02-20**|**On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems**|Juraj Vladika et.al.|[2502.14759v1](http://arxiv.org/abs/2502.14759v1)|null| +|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| +|**2025-02-20**|**TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators**|Jianling Li et.al.|[2502.14752v1](http://arxiv.org/abs/2502.14752v1)|null| +|**2025-02-20**|**Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs**|Zongxia Li et.al.|[2502.14748v1](http://arxiv.org/abs/2502.14748v1)|null| +|**2025-02-20**|**HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States**|Yilei Jiang et.al.|[2502.14744v1](http://arxiv.org/abs/2502.14744v1)|null| +|**2025-02-20**|**Multi-Agent Coordination across Diverse Applications: A Survey**|Lijun Sun et.al.|[2502.14743v1](http://arxiv.org/abs/2502.14743v1)|null| +|**2025-02-20**|**YOLOv12: A Breakdown of the Key Architectural Features**|Mujadded Al Rabbani Alif et.al.|[2502.14740v1](http://arxiv.org/abs/2502.14740v1)|null| +|**2025-02-20**|**SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines**|M-A-P Team et.al.|[2502.14739v1](http://arxiv.org/abs/2502.14739v1)|null| +|**2025-02-20**|**EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration**|Minjie Hong et.al.|[2502.14735v1](http://arxiv.org/abs/2502.14735v1)|null| +|**2025-02-20**|**Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models**|Hongji Li et.al.|[2502.14734v1](http://arxiv.org/abs/2502.14734v1)|null| +|**2025-02-20**|**WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models**|Yifu Chen et.al.|[2502.14727v1](http://arxiv.org/abs/2502.14727v1)|null| +|**2025-02-20**|**Entity Framing and Role Portrayal in the News**|Tarek Mahmoud et.al.|[2502.14718v1](http://arxiv.org/abs/2502.14718v1)|null| +|**2025-02-20**|**From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT**|Ahmed Abdeen Hamed et.al.|[2502.14714v1](http://arxiv.org/abs/2502.14714v1)|null| +|**2025-02-20**|**Data-Efficient Pretraining with Group-Level Data Influence Modeling**|Zichun Yu et.al.|[2502.14709v1](http://arxiv.org/abs/2502.14709v1)|null| +|**2025-02-20**|**Human Misperception of Generative-AI Alignment: A Laboratory Experiment**|Kevin He et.al.|[2502.14708v1](http://arxiv.org/abs/2502.14708v1)|null| +|**2025-02-20**|**Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting**|Yuxuan Yang et.al.|[2502.14704v1](http://arxiv.org/abs/2502.14704v1)|null| +|**2025-02-20**|**I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search**|Zujie Liang et.al.|[2502.14693v1](http://arxiv.org/abs/2502.14693v1)|null| +|**2025-02-20**|**Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup**|Yonghui Kong et.al.|[2502.14682v1](http://arxiv.org/abs/2502.14682v1)|null| +|**2025-02-20**|**How to Get Your LLM to Generate Challenging Problems for Evaluation**|Arkil Patel et.al.|[2502.14678v1](http://arxiv.org/abs/2502.14678v1)|null| +|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| +|**2025-02-20**|**BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction**|Ruochen Li et.al.|[2502.14676v1](http://arxiv.org/abs/2502.14676v1)|null| +|**2025-02-20**|**Explanations of Deep Language Models Explain Language Representations in the Brain**|Maryam Rahimi et.al.|[2502.14671v1](http://arxiv.org/abs/2502.14671v1)|null| +|**2025-02-20**|**AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO**|Alan Dao et.al.|[2502.14669v1](http://arxiv.org/abs/2502.14669v1)|null| +|**2025-02-20**|**InstructAgent: Building User Controllable Recommender via LLM Agent**|Wujiang Xu et.al.|[2502.14662v1](http://arxiv.org/abs/2502.14662v1)|null| +|**2025-02-20**|**Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs**|Yuchen Wu et.al.|[2502.14645v1](http://arxiv.org/abs/2502.14645v1)|null| +|**2025-02-20**|**LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning**|Yansheng Mao et.al.|[2502.14644v1](http://arxiv.org/abs/2502.14644v1)|null| +|**2025-02-20**|**Length-Controlled Margin-Based Preference Optimization without Reference Model**|Gengxu Li et.al.|[2502.14643v1](http://arxiv.org/abs/2502.14643v1)|null| +|**2025-02-20**|**How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation**|Rui Li et.al.|[2502.14642v1](http://arxiv.org/abs/2502.14642v1)|null| +|**2025-02-20**|**NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization**|Zheyuan Zhang et.al.|[2502.14638v1](http://arxiv.org/abs/2502.14638v1)|null| +|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| +|**2025-02-20**|**PEARL: Towards Permutation-Resilient LLMs**|Liang Chen et.al.|[2502.14628v1](http://arxiv.org/abs/2502.14628v1)|null| +|**2025-02-20**|**ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors**|Yuguo Yin et.al.|[2502.14627v1](http://arxiv.org/abs/2502.14627v1)|null| +|**2025-02-20**|**Multi-Record Web Page Information Extraction From News Websites**|Alexander Kustenkov et.al.|[2502.14625v1](http://arxiv.org/abs/2502.14625v1)|null| +|**2025-02-20**|**Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity**|Xinghan Pan et.al.|[2502.14620v1](http://arxiv.org/abs/2502.14620v1)|[link](https://github.com/PStarH/RWKV-embedding)| +|**2025-02-20**|**Reward Models Identify Consistency, Not Causality**|Yuhui Xu et.al.|[2502.14619v1](http://arxiv.org/abs/2502.14619v1)|null| +|**2025-02-20**|**FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis**|Mingyi Jia et.al.|[2502.14614v1](http://arxiv.org/abs/2502.14614v1)|null| +|**2025-02-20**|**Behavioral Analysis of Information Salience in Large Language Models**|Jan Trienes et.al.|[2502.14613v1](http://arxiv.org/abs/2502.14613v1)|null| +|**2025-02-20**|**A Theory for Conditional Generative Modeling on Multiple Data Sources**|Rongzhen Wang et.al.|[2502.14583v1](http://arxiv.org/abs/2502.14583v1)|null| +|**2025-02-20**|**A Statistical Case Against Empirical Human-AI Alignment**|Julian Rodemann et.al.|[2502.14581v1](http://arxiv.org/abs/2502.14581v1)|null| +|**2025-02-20**|**ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification**|Hyunseok Lee et.al.|[2502.14565v1](http://arxiv.org/abs/2502.14565v1)|null| +|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| +|**2025-02-20**|**Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs**|Paris Koloveas et.al.|[2502.14561v1](http://arxiv.org/abs/2502.14561v1)|null| +|**2025-02-20**|**Less is More: Improving LLM Alignment via Preference Data Selection**|Xun Deng et.al.|[2502.14560v1](http://arxiv.org/abs/2502.14560v1)|null| +|**2025-02-20**|**FUIA: Model Inversion Attack against Federated Unlearning**|Lei Zhou et.al.|[2502.14558v1](http://arxiv.org/abs/2502.14558v1)|null| +|**2025-02-20**|**Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling**|Eric Egli et.al.|[2502.14553v1](http://arxiv.org/abs/2502.14553v1)|null| +|**2025-02-20**|**Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks**|Maya Bechler-Speicher et.al.|[2502.14546v1](http://arxiv.org/abs/2502.14546v1)|null| +|**2025-02-20**|**LLM-based User Profile Management for Recommender System**|Seunghwan Bang et.al.|[2502.14541v1](http://arxiv.org/abs/2502.14541v1)|null| +|**2025-02-20**|**LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization**|Yupeng Chang et.al.|[2502.14538v1](http://arxiv.org/abs/2502.14538v1)|null| +|**2025-02-20**|**CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models**|Zhenhong Zhou et.al.|[2502.14529v1](http://arxiv.org/abs/2502.14529v1)|null| +|**2025-02-20**|**Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting**|Yannick Wölker et.al.|[2502.14525v1](http://arxiv.org/abs/2502.14525v1)|null| +|**2025-02-20**|**Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation**|Austin A. Barr et.al.|[2502.14523v1](http://arxiv.org/abs/2502.14523v1)|null| +|**2025-02-20**|**MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality**|Artur Kot et.al.|[2502.14509v1](http://arxiv.org/abs/2502.14509v1)|null| +|**2025-02-20**|**Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases**|Rena Gao et.al.|[2502.14507v1](http://arxiv.org/abs/2502.14507v1)|null| +|**2025-02-20**|**PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models**|Yu Meng et.al.|[2502.14504v1](http://arxiv.org/abs/2502.14504v1)|null| +|**2025-02-20**|**How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?**|Sergey Pletenev et.al.|[2502.14502v1](http://arxiv.org/abs/2502.14502v1)|null| +|**2025-02-20**|**Towards a Perspectivist Turn in Argument Quality Assessment**|Julia Romberg et.al.|[2502.14501v1](http://arxiv.org/abs/2502.14501v1)|null| +|**2025-02-20**|**MLGym: A New Framework and Benchmark for Advancing AI Research Agents**|Deepak Nathani et.al.|[2502.14499v1](http://arxiv.org/abs/2502.14499v1)|null| +|**2025-02-20**|**Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups**|Felix Drinkall et.al.|[2502.14497v1](http://arxiv.org/abs/2502.14497v1)|null| +|**2025-02-20**|**Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization**|Zhitao He et.al.|[2502.14496v1](http://arxiv.org/abs/2502.14496v1)|null| +|**2025-02-20**|**StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following**|Jinnan Li et.al.|[2502.14494v1](http://arxiv.org/abs/2502.14494v1)|null| +|**2025-02-20**|**Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk**|Elija Perrier et.al.|[2502.14491v1](http://arxiv.org/abs/2502.14491v1)|null| +|**2025-02-20**|**Temporal Misalignment and Probabilistic Neurons**|Velibor Bojković et.al.|[2502.14487v1](http://arxiv.org/abs/2502.14487v1)|null| +|**2025-02-20**|**How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation**|Zhuohang Long et.al.|[2502.14486v1](http://arxiv.org/abs/2502.14486v1)|null| +|**2025-02-20**|**NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models**|Chenlu Guo et.al.|[2502.14482v1](http://arxiv.org/abs/2502.14482v1)|null| +|**2025-02-20**|**Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression**|Haoyu Wang et.al.|[2502.14477v1](http://arxiv.org/abs/2502.14477v1)|null| +|**2025-02-20**|**Argument-Based Comparative Question Answering Evaluation Benchmark**|Irina Nikishina et.al.|[2502.14476v1](http://arxiv.org/abs/2502.14476v1)|null| +|**2025-02-20**|**Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models**|Aurora Polo-Rodríguez et.al.|[2502.14469v1](http://arxiv.org/abs/2502.14469v1)|null| +|**2025-02-20**|**Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing**|Aviv Bick et.al.|[2502.14458v1](http://arxiv.org/abs/2502.14458v1)|null| +|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| +|**2025-02-20**|**Optimal word order for non-causal text generation with Large Language Models: the Spanish case**|Andrea Busto-Castiñeira et.al.|[2502.14451v1](http://arxiv.org/abs/2502.14451v1)|null| -摘要:前列腺癌是全球男性最常見的癌症,其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力,但這些預測通常缺乏內在的可解釋性,可能會導致對人機互動的不信任。為了解決這個問題,我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述,用於符合國際準則的 Gleason 分級。利用這個資料集,我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統,該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法,同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能(Dice 分數:0.713 ± 0.003,訓練於解釋,相對於 0.691 ± 0.010,訓練於 Gleason 模式)。透過在訓練期間採用軟標籤,我們捕捉了資料中的內在不確定性,即使在觀察者間變異性高的情況下,也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集,我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割,並增進對病理學家推理過程的理解。 +#### Abstracts +##### **LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention** +2502.14866v1 by Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han -##### **Explainable AI Methods for Multi-Omics Analysis: A Survey** -2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee +Large language models (LLMs) have shown remarkable potential in processing +long sequences, yet efficiently serving these long-context models remains +challenging due to the quadratic computational complexity of attention in the +prefilling stage and the large memory footprint of the KV cache in the decoding +stage. To address these issues, we introduce LServe, an efficient system that +accelerates long-sequence LLM serving via hybrid sparse attention. This method +unifies different hardware-friendly, structured sparsity patterns for both +prefilling and decoding attention into a single framework, where computations +on less important tokens are skipped block-wise. LServe demonstrates the +compatibility of static and dynamic sparsity in long-context LLM attention. +This design enables multiplicative speedups by combining these optimizations. +Specifically, we convert half of the attention heads to nearly free streaming +heads in both the prefilling and decoding stages. Additionally, we find that +only a constant number of KV pages is required to preserve long-context +capabilities, irrespective of context length. We then design a hierarchical KV +page selection policy that dynamically prunes KV pages based on query-centric +similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and +decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is +released at https://github.com/mit-han-lab/omniserve. -Advancements in high-throughput technologies have led to a shift from -traditional hypothesis-driven methodologies to data-driven approaches. -Multi-omics refers to the integrative analysis of data derived from multiple -'omes', such as genomics, proteomics, transcriptomics, metabolomics, and -microbiomics. This approach enables a comprehensive understanding of biological -systems by capturing different layers of biological information. Deep learning -methods are increasingly utilized to integrate multi-omics data, offering -insights into molecular interactions and enhancing research into complex -diseases. However, these models, with their numerous interconnected layers and -nonlinear relationships, often function as black boxes, lacking transparency in -decision-making processes. To overcome this challenge, explainable artificial -intelligence (xAI) methods are crucial for creating transparent models that -allow clinicians to interpret and work with complex data more effectively. This -review explores how xAI can improve the interpretability of deep learning -models in multi-omics research, highlighting its potential to provide -clinicians with clear insights, thereby facilitating the effective application -of such models in clinical settings. +摘要:大型語言模型 (LLM) 在處理長序列方面展現出驚人的潛力,但由於預填充階段注意力的二次計算複雜度和解碼階段 KV 快取的大量記憶體使用量,有效提供這些長語境模型服務仍然具有挑戰性。為了解決這些問題,我們引入了 LServe,一個透過混合稀疏注意力加速長序列 LLM 服務的高效系統。此方法將不同的硬體友善的結構化稀疏模式統一到一個單一的架構中,用於預填充和解碼注意力,其中對較不重要的符號的運算會以區塊方式略過。LServe 證明了靜態和動態稀疏性在長語境 LLM 注意力中的相容性。此設計透過結合這些最佳化來實現倍增加速。具體來說,我們將一半的注意力頭轉換為預填充和解碼階段中幾乎免費的串流頭。此外,我們發現僅需要恆定的 KV 頁數來保留長語境功能,而與語境長度無關。然後,我們設計了一個分層式 KV 頁面選擇策略,根據以查詢為中心的相似性動態刪除 KV 頁面。平均而言,LServe 將 LLM 預填充加速了 2.9 倍,將解碼加速了 1.3-2.1 倍,同時維持長語境的準確性。程式碼已發布在 https://github.com/mit-han-lab/omniserve。 -摘要:高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料,例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面,能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料,提供分子交互作用的洞察力,並加強對複雜疾病的研究。然而,這些模型具有許多相互連接的層級和非線性關係,通常會像黑盒子一樣運作,缺乏決策過程的透明度。為了克服此挑戰,可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要,讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性,強調其提供臨床醫生明確見解的潛力,進而促進此類模型在臨床環境中的有效應用。 +##### **Interpretable Text Embeddings and Text Similarity Explanation: A Primer** +2502.14862v1 by Juri Opitz, Lucas Möller, Andrianos Michail, Simon Clematide -##### **Study on the Helpfulness of Explainable Artificial Intelligence** -2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing +Text embeddings and text embedding models are a backbone of many AI and NLP +systems, particularly those involving search. However, interpretability +challenges persist, especially in explaining obtained similarity scores, which +is crucial for applications requiring transparency. In this paper, we give a +structured overview of interpretability methods specializing in explaining +those similarity scores, an emerging research area. We study the methods' +individual ideas and techniques, evaluating their potential for improving +interpretability of text embeddings and explaining predicted similarities. -Explainable Artificial Intelligence (XAI) is essential for building advanced -machine learning-powered applications, especially in critical domains such as -medical diagnostics or autonomous driving. Legal, business, and ethical -requirements motivate using effective XAI, but the increasing number of -different methods makes it challenging to pick the right ones. Further, as -explanations are highly context-dependent, measuring the effectiveness of XAI -methods without users can only reveal a limited amount of information, -excluding human factors such as the ability to understand it. We propose to -evaluate XAI methods via the user's ability to successfully perform a proxy -task, designed such that a good performance is an indicator for the explanation -to provide helpful information. In other words, we address the helpfulness of -XAI for human decision-making. Further, a user study on state-of-the-art -methods was conducted, showing differences in their ability to generate trust -and skepticism and the ability to judge the rightfulness of an AI decision -correctly. Based on the results, we highly recommend using and extending this -approach for more objective-based human-centered user studies to measure XAI -performance in an end-to-end fashion. +摘要:文字嵌入和文字嵌入模型是許多 AI 和 NLP 系統的骨幹,特別是那些涉及搜尋的系統。然而,可解釋性的挑戰依然存在,特別是在解釋獲得的相似度分數時,這對於需要透明度的應用程式至關重要。在本文中,我們對專門用於解釋這些相似度分數的可解釋性方法給予結構化的概述,這是一個新興的研究領域。我們研究了這些方法的個別想法和技術,評估它們改善文字嵌入的可解釋性和解釋預測相似度的潛力。 -摘要:可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要,特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI,但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外,由於解釋高度依賴於背景,在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊,排除人類因素,例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法,設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說,我們探討 XAI 對人類決策制定的幫助。此外,對最先進的方法進行使用者研究,顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果,我們強烈建議使用和擴充這種方法,以進行更多以目標為基礎的人為中心使用者研究,以終端到終端的方式衡量 XAI 效能。 +##### **Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning** +2502.14860v1 by Shuyue Stella Li, Jimin Mun, Faeze Brahman, Jonathan S. Ilgen, Yulia Tsvetkov, Maarten Sap -##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health** -2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh +Large language models (LLMs) often fail to ask effective questions under +uncertainty, making them unreliable in domains where proactive +information-gathering is essential for decisionmaking. We present ALFA, a +framework that improves LLM question-asking by (i) decomposing the notion of a +"good" question into a set of theory-grounded attributes (e.g., clarity, +relevance), (ii) controllably synthesizing attribute-specific question +variations, and (iii) aligning models via preference-based optimization to +explicitly learn to ask better questions along these fine-grained attributes. +Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs +dataset, composed of 17k real-world clinical interactions augmented with 80k +attribute-specific preference pairs of follow-up questions, as well as a novel +expert-annotated interactive healthcare QA task to evaluate question-asking +abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on +MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level +win-rate of 64.4% and strong generalizability. Our findings suggest that +explicitly guiding question-asking with structured, fine-grained attributes +offers a scalable path to improve LLMs, especially in expert application +domains. -Early detection of intrapartum risk enables interventions to potentially -prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, -there is no accurate automated system to predict such events to assist with -clinical decision-making. To fill this gap, we propose "Artificial Intelligence -(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning -framework that not only predicts adverse labor outcomes from maternal, fetal, -obstetrical, and intrapartum risk factors but also provides the model's -reasoning behind the predictions made. The latter can provide insights into -what modifications in the input variables of the model could have changed the -predicted outcome. We address the challenges of imbalance and small datasets by -synthesizing additional training data using Adaptive Synthetic Sampling -(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN -uses an ensemble of fully-connected neural networks as the backbone for its -classification with the data augmentation supported by either ADASYN or CTGAN. -AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in -classification. AIMEN can predict a high risk for adverse labor outcomes with -an average F1 score of 0.784. It also provides counterfactual explanations that -can be achieved by changing 2 to 3 attributes on average. Resources available: -https://github.com/ab9mamun/AIMEN. +摘要:大型語言模型 (LLM) 經常在不確定性下無法提出有效問題,這使得它們在主動收集資訊對於決策制定至關重要的領域中不可靠。我們提出 ALFA,一個透過 (i) 將「良好」問題的概念分解成一組以理論為基礎的屬性(例如,清晰度、相關性),(ii) 可控地合成屬性特定的問題變體,以及 (iii) 透過基於偏好的最佳化調整模型,明確學習沿著這些細緻屬性提出更好的問題,來改善 LLM 提問的架構。專注於臨床推理作為案例研究,我們引入了 MediQ-AskDocs 資料集,由 17k 個真實世界的臨床互動組成,並增加了 80k 個屬性特定的後續問題偏好配對,以及一個由專家註解的互動式醫療保健問答任務來評估提問能力。與 SOTA 指令調整的 LLM 相比,與 ALFA 對齊的模型將 MediQ-AskDocs 上的診斷錯誤減少了 56.6%,問題層級的勝率為 64.4%,並且具有很強的普遍性。我們的研究結果表明,明確地以結構化、細緻的屬性來引導提問,提供了一條可擴充的途徑來改善 LLM,特別是在專家應用領域。 -摘要:產程中風險的早期偵測有助於進行干預措施,以預防或減輕不利的生產結果,例如腦性麻痺。目前,沒有準確的自動化系統可以預測此類事件,以協助臨床決策。為了填補這一空白,我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN),這是一個深度學習架構,它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果,還能提供模型做出預測背後的原因。後者可以提供見解,說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料,以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹,並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險,平均 F1 分數為 0.784。它還提供反事實解釋,可透過平均變更 2 至 3 個屬性來達成。可用資源:https://github.com/ab9mamun/AIMEN。 +##### **FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling** +2502.14856v1 by Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun -##### **Artificial intelligence techniques in inherited retinal diseases: A review** -2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian +Speculative sampling has emerged as an important technique for accelerating +the auto-regressive generation process of large language models (LLMs) by +utilizing a draft-then-verify mechanism to produce multiple tokens per forward +pass. While state-of-the-art speculative sampling methods use only a single +layer and a language modeling (LM) head as the draft model to achieve +impressive layer compression, their efficiency gains are substantially reduced +for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. +To address this, we present FR-Spec, a frequency-ranked speculative sampling +framework that optimizes draft candidate selection through vocabulary space +compression. By constraining the draft search to a frequency-prioritized token +subset, our method reduces LM Head computation overhead by 75% while ensuring +the equivalence of the final output distribution. Experiments across multiple +datasets demonstrate an average of 1.12$\times$ speedup over the +state-of-the-art speculative sampling method EAGLE-2. -Inherited retinal diseases (IRDs) are a diverse group of genetic disorders -that lead to progressive vision loss and are a major cause of blindness in -working-age adults. The complexity and heterogeneity of IRDs pose significant -challenges in diagnosis, prognosis, and management. Recent advancements in -artificial intelligence (AI) offer promising solutions to these challenges. -However, the rapid development of AI techniques and their varied applications -have led to fragmented knowledge in this field. This review consolidates -existing studies, identifies gaps, and provides an overview of AI's potential -in diagnosing and managing IRDs. It aims to structure pathways for advancing -clinical applications by exploring AI techniques like machine learning and deep -learning, particularly in disease detection, progression prediction, and -personalized treatment planning. Special focus is placed on the effectiveness -of convolutional neural networks in these areas. Additionally, the integration -of explainable AI is discussed, emphasizing its importance in clinical settings -to improve transparency and trust in AI-based systems. The review addresses the -need to bridge existing gaps in focused studies on AI's role in IRDs, offering -a structured analysis of current AI techniques and outlining future research -directions. It concludes with an overview of the challenges and opportunities -in deploying AI for IRDs, highlighting the need for interdisciplinary -collaboration and the continuous development of robust, interpretable AI models -to advance clinical applications. +摘要:推測取樣已成為一種重要的技術,可用於透過利用先起草後驗證的機制來加速大型語言模型 (LLM) 的自迴歸生成過程,並在每次前向傳遞中產生多個代幣。儘管最先進的推測取樣方法只使用單一層和語言建模 (LM) 頭作為起草模型,以達成令人印象深刻的層壓縮,但對於大型詞彙表 LLM(例如詞彙表包含 128k 個代幣的 Llama-3-8B),其效率提升會大幅降低。為了解決這個問題,我們提出了 FR-Spec,這是一種頻率排序推測取樣架構,它透過詞彙空間壓縮來最佳化起草候選選取。我們的這個方法透過將起草搜尋限制在優先於頻率的代幣子集中,將 LM 頭部運算開銷減少了 75%,同時確保最終輸出分佈的等效性。透過多個資料集的實驗證明,與最先進的推測取樣方法 EAGLE-2 相比,平均提速了 1.12 倍。 -摘要:遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病, -會導致視力逐漸喪失,是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。 -然而,AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究,找出差距,並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術,特別是在疾病檢測、進程預測和個性化治療計劃中,為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外,討論了可解釋 AI 的整合,強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性,提供了對當前 AI 技術的結構化分析,並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇,強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。 +##### **Prompt-to-Leaderboard** +2502.14855v1 by Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica -##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures** -2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri +Large language model (LLM) evaluations typically rely on aggregated metrics +like accuracy or human preference, averaging across users and prompts. This +averaging obscures user- and prompt-specific variations in model performance. +To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces +leaderboards specific to a prompt. The core idea is to train an LLM taking +natural language prompts as input to output a vector of Bradley-Terry +coefficients which are then used to predict the human preference vote. The +resulting prompt-dependent leaderboards allow for unsupervised task-specific +evaluation, optimal routing of queries to models, personalization, and +automated evaluation of model strengths and weaknesses. Data from Chatbot Arena +suggest that P2L better captures the nuanced landscape of language model +performance than the averaged leaderboard. Furthermore, our findings suggest +that P2L's ability to produce prompt-specific evaluations follows a power law +scaling similar to that observed in LLMs themselves. In January 2025, the +router we trained based on this methodology achieved the \#1 spot in the +Chatbot Arena leaderboard. Our code is available at this GitHub link: +https://github.com/lmarena/p2l. -Explaining Artificial Intelligence (AI) decisions is a major challenge -nowadays in AI, in particular when applied to sensitive scenarios like medicine -and law. However, the need to explain the rationale behind decisions is a main -issue also for human-based deliberation as it is important to justify -\textit{why} a certain decision has been taken. Resident medical doctors for -instance are required not only to provide a (possibly correct) diagnosis, but -also to explain how they reached a certain conclusion. Developing new tools to -aid residents to train their explanation skills is therefore a central -objective of AI in education. In this paper, we follow this direction, and we -present, to the best of our knowledge, the first multilingual dataset for -Medical Question Answering where correct and incorrect diagnoses for a clinical -case are enriched with a natural language explanation written by doctors. These -explanations have been manually annotated with argument components (i.e., -premise, claim) and argument relations (i.e., attack, support), resulting in -the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases -in four languages (English, Spanish, French, Italian) with explanations, where -we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 -attack relations. We conclude by showing how competitive baselines perform over -this challenging dataset for the argument mining task. +摘要:大型語言模型 (LLM) 評估通常依賴於彙總的指標,例如準確性或人類偏好,平均值跨使用者和提示。此平均值模糊了使用者和提示特定的模型效能變異。為了解決此問題,我們提出提示到排行榜 (P2L),一種產生特定於提示的排行榜的方法。核心概念是訓練 LLM,將自然語言提示作為輸入,以輸出 Bradley-Terry 係數向量,然後用於預測人類偏好投票。產生的提示相關排行榜允許無監督任務特定評估、最佳查詢路由至模型、個人化以及模型優缺點的自動化評估。來自 Chatbot Arena 的資料表明,P2L 比平均排行榜更能捕捉語言模型效能的細微變化。此外,我們的研究結果表明,P2L 產生提示特定評估的能力遵循類似於 LLM 本身觀察到的冪律縮放。2025 年 1 月,我們根據此方法訓練的路由器在 Chatbot Arena 排行榜中獲得了第一名。我們的程式碼可在 GitHub 連結取得:https://github.com/lmarena/p2l。 -摘要:解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰,特別是應用於像醫學和法律等敏感情境時。然而,解釋決策背後理由的需求也是基於人類的考量的一個主要問題,因為有必要證明為什麼做出某個決策。例如,住院醫師不僅需要提供(可能是正確的)診斷,還需要解釋他們如何達成某個結論。因此,開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中,我們遵循這個方向,並且根據我們的了解,提出第一個多語言醫學問答資料集,其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成(即前提、主張)和論證關係(即攻擊、支持)進行手動註解,產生多語言 CasiMedicos-Arg 資料集,其中包含 558 個具有解釋的四種語言(英語、西班牙語、法語、義大利語)的臨床病例,我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。 +##### **CLIPPER: Compression enables long-context synthetic data generation** +2502.14854v1 by Chau Minh Pham, Yapei Chang, Mohit Iyyer -##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration** -2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu +LLM developers are increasingly reliant on synthetic data, but generating +high-quality data for complex long-context reasoning tasks remains challenging. +We introduce CLIPPER, a compression-based approach for generating synthetic +data tailored to narrative claim verification - a task that requires reasoning +over a book to verify a given claim. Instead of generating claims directly from +the raw text of the book, which results in artifact-riddled claims, CLIPPER +first compresses the book into chapter outlines and book summaries and then +uses these intermediate representations to generate complex claims and +corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces +claims that are more valid, grounded, and complex. Using CLIPPER, we construct +a dataset of 19K synthetic book claims paired with their source texts and +chain-of-thought reasoning, and use it to fine-tune three open-weight models. +Our best model achieves breakthrough results on narrative claim verification +(from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for +sub-10B models on the NoCha leaderboard. Further analysis shows that our models +generate more detailed and grounded chain-of-thought reasoning while also +improving performance on other narrative understanding tasks (e.g., +NarrativeQA). -Diagnosis prediction is a critical task in healthcare, where timely and -accurate identification of medical conditions can significantly impact patient -outcomes. Traditional machine learning and deep learning models have achieved -notable success in this domain but often lack interpretability which is a -crucial requirement in clinical settings. In this study, we explore the use of -neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop -explainable models for diagnosis prediction. Essentially, we design and -implement LNN-based models that integrate domain-specific knowledge through -logical rules with learnable thresholds. Our models, particularly -$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior -performance over traditional models such as Logistic Regression, SVM, and -Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up -to 0.8457) in the case study of diabetes prediction. The learned weights and -thresholds within the LNN models provide direct insights into feature -contributions, enhancing interpretability without compromising predictive -power. These findings highlight the potential of neuro-symbolic approaches in -bridging the gap between accuracy and explainability in healthcare AI -applications. By offering transparent and adaptable diagnostic models, our work -contributes to the advancement of precision medicine and supports the -development of equitable healthcare solutions. Future research will focus on -extending these methods to larger and more diverse datasets to further validate -their applicability across different medical conditions and populations. +摘要:LLM 開發人員越來越依賴合成資料,但為複雜的長語境推理任務生成高品質資料仍然具有挑戰性。我們引入了 CLIPPER,一種基於壓縮的方法,用於生成針對敘事性聲明驗證量身打造的合成資料,這項任務需要對一本書進行推理才能驗證給定的聲明。CLIPPER 沒有直接從書籍的原始文字生成聲明,這會產生充滿人工製品的聲明,而是先將書籍壓縮成章節大綱和書籍摘要,然後使用這些中間表示來生成複雜的聲明和對應的思維鏈。與天真的方法相比,CLIPPER 產生的聲明更有效、更有根據且更複雜。使用 CLIPPER,我們構建了一個包含 19K 個合成書籍聲明及其原始文字和思維鏈推理的資料集,並用於微調三個開放權重模型。我們最好的模型在敘事性聲明驗證方面取得了突破性的結果(在我們的測試集中準確率從 28% 提升到 76%),並在 NoCha 排行榜上為低於 10B 的模型設定了新的技術水準。進一步的分析表明,我們的模型生成了更詳細且有根據的思維鏈推理,同時也提高了其他敘事理解任務(例如 NarrativeQA)的效能。 -摘要:診斷預測是醫療保健中的關鍵任務,及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功,但通常缺乏可解釋性,這在臨床環境中是一項關鍵要求。在本研究中,我們探討了神經符號方法的應用,特別是邏輯神經網路 (LNN),以開發用於診斷預測的可解釋模型。基本上,我們設計並實作了基於 LNN 的模型,這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型,特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$,表現出優於傳統模型(例如邏輯迴歸、SVM 和隨機森林)的優異效能,在糖尿病預測的案例研究中達到了更高的準確度(高達 80.52%)和 AUROC 分數(高達 0.8457)。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解,增強了可解釋性,同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型,我們的研究有助於推進精準醫療,並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集,以進一步驗證其在不同醫療狀況和人群中的適用性。 +##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** +2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu -##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare** -2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty +Large Language Models (LLMs) have shown great promise in tool-making, yet +existing frameworks often struggle to efficiently construct reliable toolsets +and are limited to single-task settings. To address these challenges, we +propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that +dynamically constructs and evolves a hierarchical graph of reusable tools +across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), +agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, +TabMWP). Our results show that GATE achieves up to 4.3x faster milestone +completion in Minecraft compared to the previous SOTA, and provides an average +improvement of 9.23% over existing tool-making methods in code generation tasks +and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, +balancing tool quantity, complexity, and functionality while maintaining high +efficiency. Code and data are available at +\url{https://github.com/ayanami2003/GATE}. -The rapid advancements in artificial intelligence (AI) have revolutionized -smart healthcare, driving innovations in wearable technologies, continuous -monitoring devices, and intelligent diagnostic systems. However, security, -explainability, robustness, and performance optimization challenges remain -critical barriers to widespread adoption in clinical environments. This -research presents an innovative algorithmic method using the Adaptive Feature -Evaluator (AFE) algorithm to improve feature selection in healthcare datasets -and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable -Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT), -the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby -enhancing predictive accuracy and interpretability. The proposed method is -validated across three diverse healthcare datasets using six distinct machine -learning algorithms, demonstrating its robustness and superiority over -conventional feature selection techniques. The results underscore the -transformative potential of AFE in smart healthcare, enabling personalized and -transparent patient care. Notably, the AFE algorithm, when combined with a -Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting -its capability to improve clinical decision-making processes in real-world -healthcare applications. +摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 -摘要:人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健,推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而,安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法,使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT),該演算法最佳化了臨床決策支援系統 (CDSS),從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集,證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力,實現了個人化和透明的患者照護。值得注意的是,AFE 演算法與多層感知器 (MLP) 結合使用時,準確度高達 98.5%,突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。 +##### **Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation** +2502.14846v1 by Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark -##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study** -2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker +Reasoning about images with rich text, such as charts and documents, is a +critical application of vision-language models (VLMs). However, VLMs often +struggle in these domains due to the scarcity of diverse text-rich +vision-language data. To address this challenge, we present CoSyn, a framework +that leverages the coding capabilities of text-only large language models +(LLMs) to automatically create synthetic text-rich multimodal data. Given input +text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts +an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic +images. With the underlying code as textual representations of the synthetic +images, CoSyn can generate high-quality instruction-tuning data, again relying +on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K +images and 2.7M rows of vision-language instruction-tuning data. Comprehensive +experiments on seven benchmarks demonstrate that models trained on our +synthetic data achieve state-of-the-art performance among competitive +open-source models, including Llama 3.2, and surpass proprietary models such as +GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing +data, enabling VLMs to ground information within input images, showcasing its +potential for developing multimodal agents capable of acting in real-world +environments. -Artificial intelligence (AI) systems have substantially improved -dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI) -systems further enhancing clinicians' confidence and trust in AI-driven -decisions. Despite these advancements, there remains a critical need for -objective evaluation of how dermatologists engage with both AI and XAI tools. -In this study, 76 dermatologists participated in a reader study, diagnosing 16 -dermoscopic images of melanomas and nevi using an XAI system that provides -detailed, domain-specific explanations. Eye-tracking technology was employed to -assess their interactions. Diagnostic performance was compared with that of a -standard AI system lacking explanatory features. Our findings reveal that XAI -systems improved balanced diagnostic accuracy by 2.8 percentage points relative -to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and -complex lesions were associated with elevated cognitive load, as evidenced by -increased ocular fixations. These insights have significant implications for -clinical practice, the design of AI tools for visual tasks, and the broader -development of XAI in medical diagnostics. +摘要:透過豐富文字(例如圖表和文件)對影像進行推理,是視覺語言模型 (VLM) 的重要應用。然而,由於多元化文字豐富的視覺語言資料稀少,VLM 在這些領域中經常會遇到困難。為了應對這個挑戰,我們提出了 CoSyn,一個利用純文字大型語言模型 (LLM) 的編碼能力來自動建立合成文字豐富多模態資料的架構。給定描述目標網域的輸入文字(例如「營養成分標籤」),CoSyn 會提示 LLM 產生用於合成影像渲染的程式碼(Python、HTML、LaTeX 等)。透過將底層程式碼作為合成影像的文字表示,CoSyn 可以產生高品質的指令調整資料,再次依賴純文字 LLM。使用 CoSyn,我們建構了一個包含 40 萬張影像和 270 萬列視覺語言指令調整資料的資料集。在七個基準上的全面實驗證明,在我們的合成資料上訓練的模型在競爭對手的開源模型(包括 Llama 3.2)中達到了最先進的效能,並超越了 GPT-4V 和 Gemini 1.5 Flash 等專有模型。此外,CoSyn 可以產生合成指向資料,讓 VLM 能在輸入影像中建立資訊基礎,展示其在開發能夠在真實世界環境中運作的多模態代理方面的潛力。 -摘要:人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度,而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展,對於皮膚科醫師如何使用 AI 和 XAI 工具,仍有客觀評估的迫切需求。在這項研究中,76 位皮膚科醫師參與了一項讀者研究,使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像,該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示,XAI 系統相較於標準 AI,將平衡診斷準確度提升了 2.8 個百分點。此外,與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關,這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。 +##### **Revealing and Mitigating Over-Attention in Knowledge Editing** +2502.14838v1 by Pinzheng Wang, Zecheng Tang, Keyan Zhou, Juntao Li, Qiaoming Zhu, Min Zhang -##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data** -2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar +Large Language Models have demonstrated superior performance across a wide +range of tasks, but they still exhibit undesirable errors due to incorrect +knowledge learned from the training data. To avoid this, knowledge editing +methods emerged to precisely edit the specific model knowledge via efficiently +modifying a very small percentage of parameters. % However, those methods can +lead to the problem of Specificity Failure: when the content related to the +edited knowledge occurs in the context, it can inadvertently corrupt other +pre-existing knowledge. However, those methods can lead to the problem of +Specificity Failure, where the existing knowledge and capabilities are severely +degraded due to editing. Our preliminary indicates that Specificity Failure +primarily stems from the model's attention heads assigning excessive attention +scores to entities related to the edited knowledge, thereby unduly focusing on +specific snippets within the context, which we denote as the Attention Drift +phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet +effective method Selective Attention Drift Restriction}(SADR), which introduces +an additional regularization term during the knowledge editing process to +restrict changes in the attention weight distribution, thereby preventing undue +focus on the edited entity. Experiments on five frequently used strong LLMs +demonstrate the effectiveness of our method, where SADR can significantly +mitigate Specificity Failure in the predominant knowledge editing tasks. -Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been -shown to significantly improve the quality of life of autistic individuals. -However, diagnostics methods for ASD rely on assessments based on clinical -presentation that are prone to bias and can be challenging to arrive at an -early diagnosis. There is a need for objective biomarkers of ASD which can help -improve diagnostic accuracy. Deep learning (DL) has achieved outstanding -performance in diagnosing diseases and conditions from medical imaging data. -Extensive research has been conducted on creating models that classify ASD -using resting-state functional Magnetic Resonance Imaging (fMRI) data. However, -existing models lack interpretability. This research aims to improve the -accuracy and interpretability of ASD diagnosis by creating a DL model that can -not only accurately classify ASD but also provide explainable insights into its -working. The dataset used is a preprocessed version of the Autism Brain Imaging -Data Exchange (ABIDE) with 884 samples. Our findings show a model that can -accurately classify ASD and highlight critical brain regions differing between -ASD and typical controls, with potential implications for early diagnosis and -understanding of the neural basis of ASD. These findings are validated by -studies in the literature that use different datasets and modalities, -confirming that the model actually learned characteristics of ASD and not just -the dataset. This study advances the field of explainable AI in medical imaging -by providing a robust and interpretable model, thereby contributing to a future -with objective and reliable ASD diagnostics. +摘要:大型語言模型已在廣泛任務中展現出卓越的效能,但由於從訓練資料中學習到不正確的知識,它們仍會出現令人不滿意的錯誤。為避免此情況,知識編輯方法應運而生,透過有效修改極少數參數來精準編輯特定模型知識。% 然而,這些方法可能會導致特異性失敗問題:當與已編輯知識相關的內容出現在文中時,可能會無意間損害其他既有知識。然而,這些方法可能會導致特異性失敗問題,因為現有知識和能力會因編輯而嚴重降低。我們的初步研究表明,特異性失敗主要源於模型的注意力權重將過度注意力分數分配給與已編輯知識相關的實體,從而過度關注文中特定的片段,我們將此現象稱為注意力偏移。為減輕這種注意力偏移問題,我們引入了一個簡單但有效的方法選擇性注意力偏移限制}(SADR),在知識編輯過程中引入一個額外的正則化項來限制注意力權重分配的變動,從而防止過度關注已編輯實體。在五個經常使用的強大 LLM 上進行的實驗證明了我們方法的有效性,其中 SADR 可以顯著減輕主要知識編輯任務中的特異性失敗。 -摘要:自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而,ASD 的診斷方法依賴於基於臨床表現的評估,容易產生偏見,且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記,以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而,現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD,還能提供可解釋見解說明其運作原理的 DL 模型,來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本,包含 884 個樣本。我們的研究結果顯示,該模型能準確分類 ASD,並強調 ASD 與典型對照組之間存在差異的關鍵腦區,對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證,證實該模型實際上學習了 ASD 的特徵,而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型,推動了醫學影像中可解釋 AI 的領域,從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。 +##### **Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs** +2502.14837v1 by Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui -##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition** -2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul +Multi-head Latent Attention (MLA) is an innovative architecture proposed by +DeepSeek, designed to ensure efficient and economical inference by +significantly compressing the Key-Value (KV) cache into a latent vector. +Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its +variants such as Grouped-Query Attention (GQA) exhibit significant cost +disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA +without pre-training from scratch is both meaningful and challenging. This +paper proposes the first data-efficient fine-tuning method for transitioning +from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, +we remove RoPE from dimensions of queries and keys that contribute less to the +attention scores, for low-rank approximation, we introduce joint SVD +approximations based on the pre-trained parameters of keys and values. These +carefully designed strategies enable MHA2MLA to recover performance using only +a small fraction (0.3% to 0.6%) of the data, significantly reducing inference +costs while seamlessly integrating with compression techniques such as KV cache +quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, +with only a 0.5% drop in LongBench performance. -The in-vivo identification of the kidney stone types during an ureteroscopy -would be a major medical advance in urology, as it could reduce the time of the -tedious renal calculi extraction process, while diminishing infection risks. -Furthermore, such an automated procedure would make possible to prescribe -anti-recurrence treatments immediately. Nowadays, only few experienced -urologists are able to recognize the kidney stone types in the images of the -videos displayed on a screen during the endoscopy. Thus, several deep learning -(DL) models have recently been proposed to automatically recognize the kidney -stone types using ureteroscopic images. However, these DL models are of black -box nature whicl limits their applicability in clinical settings. This -contribution proposes a case-based reasoning DL model which uses prototypical -parts (PPs) and generates local and global descriptors. The PPs encode for each -class (i.e., kidney stone type) visual feature information (hue, saturation, -intensity and textures) similar to that used by biologists. The PPs are -optimally generated due a new loss function used during the model training. -Moreover, the local and global descriptors of PPs allow to explain the -decisions ("what" information, "where in the images") in an understandable way -for biologists and urologists. The proposed DL model has been tested on a -database including images of the six most widespread kidney stone types. The -overall average classification accuracy was 90.37. When comparing this results -with that of the eight other DL models of the kidney stone state-of-the-art, it -can be seen that the valuable gain in explanability was not reached at the -expense of accuracy which was even slightly increased with respect to that -(88.2) of the best method of the literature. These promising and interpretable -results also encourage urologists to put their trust in AI-based solutions. +摘要:多頭潛在注意力 (MLA) 是 DeepSeek 提出的一種創新架構,旨在通過將鍵值 (KV) 快取大幅壓縮成潛在向量,確保有效率且經濟的推論。與 MLA 相比,採用多頭注意力 (MHA) 及其變體(例如分組查詢注意力 (GQA))的標準 LLM 會出現顯著的成本劣勢。讓訓練完善的 LLM(例如 Llama)能夠快速適應 MLA,而無需從頭開始預訓練,這既有意義又具有挑戰性。本文提出了第一個資料有效微調方法,用於從 MHA 轉換到 MLA (MHA2MLA),其中包含兩個關鍵組成部分:對於部分 RoPE,我們從查詢和鍵的維度中移除對注意力分數貢獻較小的 RoPE,對於低秩近似,我們基於鍵和值的預訓練參數引入聯合 SVD 近似。這些經過仔細設計的策略讓 MHA2MLA 能夠僅使用一小部分資料 (0.3% 至 0.6%) 來恢復效能,大幅降低推論成本,同時與壓縮技術(例如 KV 快取量化)無縫整合。例如,Llama2-7B 的 KV 快取大小減少了 92.19%,而 LongBench 效能僅下降了 0.5%。 -摘要:尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展,因為它可以減少繁瑣的腎結石取出過程的時間,同時降低感染風險。此外,這種自動化程序將使立即開立抗復發治療成為可能。如今,只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此,最近已提出多種深度學習 (DL) 模型,以使用輸尿管鏡圖像自動識別腎結石類型。然而,這些 DL 模型本質上是黑盒子,這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型,它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型(即腎結石類型)編碼視覺特徵信息(色調、飽和度、強度和紋理),類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數,PP 得到了最佳生成。此外,PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策(“什麼”信息,“圖像中的什麼位置”)。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時,可以看出,可解釋性的寶貴增益並未以準確性為代價,甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。 +##### **LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models** +2502.14834v1 by Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li -##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques** -2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman +Existing Large Vision-Language Models (LVLMs) can process inputs with context +lengths up to 128k visual and text tokens, yet they struggle to generate +coherent outputs beyond 1,000 words. We find that the primary limitation is the +absence of long output examples during supervised fine-tuning (SFT). To tackle +this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 +examples, each with multiple input images, an instruction, and corresponding +outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that +maintain high-fidelity to the input images, we employ Direct Preference +Optimization (DPO) to the SFT model. Given the high cost of collecting human +feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which +breaks long outputs into segments and uses iterative corrections to form +preference pairs with the original outputs. Additionally, we develop +MMLongBench-Write, a benchmark featuring six tasks to evaluate the +long-generation capabilities of VLMs. Our 7B parameter model, trained with +LongWriter-V-22k and IterDPO, achieves impressive performance on this +benchmark, outperforming larger proprietary models like GPT-4o. Code and data: +https://github.com/THU-KEG/LongWriter-V -This study explores the potential of utilizing administrative claims data, -combined with advanced machine learning and deep learning techniques, to -predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal -Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major -health insurance organization to develop prediction models for multiple -observation windows using traditional machine learning methods such as Random -Forest and XGBoost as well as deep learning approaches such as Long Short-Term -Memory (LSTM) networks. Our findings demonstrate that the LSTM model, -particularly with a 24-month observation window, exhibits superior performance -in predicting ESRD progression, outperforming existing models in the -literature. We further apply SHapley Additive exPlanations (SHAP) analysis to -enhance interpretability, providing insights into the impact of individual -features on predictions at the individual patient level. This study underscores -the value of leveraging administrative claims data for CKD management and -predicting ESRD progression. +摘要:現有的大型視覺語言模型 (LVLMs) 能處理長度達 128k 視覺和文字符號的輸入內容,但卻難以產生超過 1,000 字的連貫輸出。我們發現,主要限制在於監督微調 (SFT) 期間缺少長輸出範例。為了解決此問題,我們引入了 LongWriter-V-22k,這是一個 SFT 資料集,包含 22,158 個範例,每個範例都有多個輸入影像、一個說明和對應的輸出,範圍從 0 到 10,000 字。此外,為了產生與輸入影像高度保真的長輸出,我們對 SFT 模型採用直接偏好最佳化 (DPO)。考量到收集人類回饋的成本很高(例如 3,000 字),我們提出 IterDPO,它會將長輸出區分成幾個區塊,並使用反覆修正來形成與原始輸出的偏好配對。此外,我們開發了 MMLongBench-Write,這是一個基準,包含六項任務,用於評估 VLM 的長生成能力。我們的 7B 參數模型使用 LongWriter-V-22k 和 IterDPO 進行訓練,在這個基準上取得令人印象深刻的效能,超越了 GPT-4o 等大型專有模型。程式碼和資料:https://github.com/THU-KEG/LongWriter-V -摘要:本研究探討利用行政申報資料,結合先進機器學習與深度學習技術,預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集,使用傳統機器學習方法(例如隨機森林和 XGBoost)以及深度學習方法(例如長期短期記憶 (LSTM) 網路)開發多個觀察視窗的預測模型。我們的研究結果顯示,LSTM 模型(尤其是 24 個月觀察視窗)在預測 ESRD 進展方面表現優異,優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性,深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。 +##### **Improving the Diffusability of Autoencoders** +2502.14831v1 by Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin -##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases** -2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller +Latent diffusion models have emerged as the leading approach for generating +high-quality images and videos, utilizing compressed latent representations to +reduce the computational burden of the diffusion process. While recent +advancements have primarily focused on scaling diffusion backbones and +improving autoencoder reconstruction quality, the interaction between these +components has received comparatively less attention. In this work, we perform +a spectral analysis of modern autoencoders and identify inordinate +high-frequency components in their latent spaces, which are especially +pronounced in the autoencoders with a large bottleneck channel size. We +hypothesize that this high-frequency component interferes with the +coarse-to-fine nature of the diffusion synthesis process and hinders the +generation quality. To mitigate the issue, we propose scale equivariance: a +simple regularization strategy that aligns latent and RGB spaces across +frequencies by enforcing scale equivariance in the decoder. It requires minimal +code changes and only up to 20K autoencoder fine-tuning steps, yet +significantly improves generation quality, reducing FID by 19% for image +generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation +on Kinetics-700 17x256x256. -While large language models (LLMs) have shown promise for medical question -answering, there is limited work focused on tropical and infectious -disease-specific exploration. We build on an opensource tropical and infectious -diseases (TRINDs) dataset, expanding it to include demographic and semantic -clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM -performance on these, comparing generalist and medical LLMs, as well as LLM -outcomes to human experts. We demonstrate through systematic experimentation, -the benefit of contextual information such as demographics, location, gender, -risk factors for optimal LLM response. Finally we develop a prototype of -TRINDs-LM, a research tool that provides a playground to navigate how context -impacts LLM outputs for health. +摘要:潛在擴散模型已成為生成高品質影像和影片的主流方法,利用壓縮潛在表示來降低擴散過程的計算負擔。雖然近期的進展主要集中在擴充擴散主幹並提升自編碼器重建品質,但這些組成之間的交互作用卻鮮少受到關注。在這項研究中,我們對現代自編碼器進行頻譜分析,並在它們的潛在空間中找出不適當的高頻率組成,這在瓶頸通道尺寸較大的自編碼器中特別明顯。我們假設這種高頻率組成會干擾擴散合成過程由粗到細的性質,並阻礙生成品質。為了緩解這個問題,我們提出規模等變性:一種簡單的正則化策略,透過在解碼器中強制執行規模等變性,使潛在空間和 RGB 空間在各個頻率中保持一致。它只需要最小的程式碼變更,且僅需最多 20K 個自編碼器微調步驟,就能顯著提升生成品質,將 ImageNet-1K 256x256 上的影像生成的 FID 降低 19%,並將 Kinetics-700 17x256x256 上的影片生成的 FVD 降低至少 44%。 -摘要:儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景,但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上,並將其擴展為納入人口統計和語義臨床和消費者擴充,產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能,比較了通才和醫療 LLM,以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊(例如人口統計、位置、性別、最佳 LLM 回應的風險因素)的好處。最後,我們開發了 TRINDs-LM 的原型,這是一個研究工具,提供一個探索背景如何影響 LLM 健康輸出的平台。 +##### **Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs** +2502.14830v1 by Danni Liu, Jan Niehues -##### **Explainable AI: Definition and attributes of a good explanation for health AI** -2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group +While large language models demonstrate remarkable capabilities at +task-specific applications through fine-tuning, extending these benefits across +diverse languages is essential for broad accessibility. However, effective +cross-lingual transfer is hindered by LLM performance gaps across languages and +the scarcity of fine-tuning data in many languages. Through analysis of LLM +internal representations from over 1,000+ language pairs, we discover that +middle layers exhibit the strongest potential for cross-lingual alignment. +Building on this finding, we propose a middle-layer alignment objective +integrated into task-specific training. Our experiments on slot filling, +machine translation, and structured text generation show consistent +improvements in cross-lingual transfer, especially to lower-resource languages. +The method is robust to the choice of alignment languages and generalizes to +languages unseen during alignment. Furthermore, we show that separately trained +alignment modules can be merged with existing task-specific modules, improving +cross-lingual capabilities without full re-training. Our code is publicly +available (https://github.com/dannigt/mid-align). -Proposals of artificial intelligence (AI) solutions based on increasingly -complex and accurate predictive models are becoming ubiquitous across many -disciplines. As the complexity of these models grows, transparency and users' -understanding often diminish. This suggests that accurate prediction alone is -insufficient for making an AI-based solution truly useful. In the development -of healthcare systems, this introduces new issues related to accountability and -safety. Understanding how and why an AI system makes a recommendation may -require complex explanations of its inner workings and reasoning processes. -Although research on explainable AI (XAI) has significantly increased in recent -years and there is high demand for XAI in medicine, defining what constitutes a -good explanation remains ad hoc, and providing adequate explanations continues -to be challenging. To fully realize the potential of AI, it is critical to -address two fundamental questions about explanations for safety-critical AI -applications, such as health-AI: (1) What is an explanation in health-AI? and -(2) What are the attributes of a good explanation in health-AI? In this study, -we examined published literature and gathered expert opinions through a -two-round Delphi study. The research outputs include (1) a definition of what -constitutes an explanation in health-AI and (2) a comprehensive list of -attributes that characterize a good explanation in health-AI. +摘要:儘管大型語言模型在特定任務應用中透過微調展現出卓越的能力,但要讓這些好處擴及各種語言,對於廣泛的可及性來說至關重要。然而,有效的跨語言轉移受到跨語言 LLM 效能差距以及許多語言中微調資料的稀少性所阻礙。透過分析來自 1,000 多種語言對的 LLM 內部表示,我們發現中間層展現出最強的跨語言對齊潛力。根據這個發現,我們提出一個整合到特定任務訓練中的中間層對齊目標。我們在插槽填補、機器翻譯和結構化文字生成方面的實驗顯示,跨語言轉移持續改善,特別是對於低資源語言。此方法對於對齊語言的選擇具有穩健性,並推廣到對齊期間未曾見過的語言。此外,我們展示了單獨訓練的對齊模組可以與現有的特定任務模組合併,在不重新訓練的情況下改善跨語言能力。我們的程式碼已公開(https://github.com/dannigt/mid-align)。 -摘要:隨著越來越複雜且準確的預測模型,基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加,透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中,這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加,且醫學領域對 XAI 有很高的需求,但定義什麼構成一個好的解釋仍是臨時性的,而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力,對於安全關鍵型 AI 應用(例如健康 AI)的解釋,探討兩個基本問題至關重要:(1) 什麼是健康 AI 中的解釋?以及 (2) 健康 AI 中一個好的解釋有哪些屬性?在本研究中,我們檢視了已發表的文獻,並透過兩輪德爾菲研究收集了專家意見。研究成果包括:(1) 健康 AI 中什麼構成解釋的定義,以及 (2) 健康 AI 中一個好解釋的屬性清單。 +##### **Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps** +2502.14829v1 by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov -##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare** -2408.17401v2 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni +When prompted to think step-by-step, language models (LMs) produce a chain of +thought (CoT), a sequence of reasoning steps that the model supposedly used to +produce its prediction. However, despite much work on CoT prompting, it is +unclear if CoT reasoning is faithful to the models' parameteric beliefs. We +introduce a framework for measuring parametric faithfulness of generated +reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an +instance of this framework. FUR erases information contained in reasoning steps +from model parameters. We perform experiments unlearning CoTs of four LMs +prompted on four multi-choice question answering (MCQA) datasets. Our +experiments show that FUR is frequently able to change the underlying models' +prediction by unlearning key steps, indicating when a CoT is parametrically +faithful. Further analysis shows that CoTs generated by models post-unlearning +support different answers, hinting at a deeper effect of unlearning. +Importantly, CoT steps identified as important by FUR do not align well with +human notions of plausbility, emphasizing the need for specialized alignment -AI-driven tools for healthcare are widely acknowledged as potentially -beneficial to health practitioners and patients, e.g. the QCancer regression -tool for cancer risk prediction. However, for these tools to be trusted, they -need to be supplemented with explanations. We examine how explanations' content -and format affect user comprehension and trust when explaining QCancer's -predictions. Regarding content, we deploy SHAP and Occlusion-1. Regarding -format, we present SHAP explanations, conventionally, as charts (SC) and -Occlusion-1 explanations as charts (OC) as well as text (OT), to which their -simpler nature lends itself. We conduct experiments with two sets of -stakeholders: the general public (representing patients) and medical students -(representing healthcare practitioners). Our experiments showed higher -subjective comprehension and trust for Occlusion-1 over SHAP explanations based -on content. However, when controlling for format, only OT outperformed SC, -suggesting this trend is driven by preferences for text. Other findings -corroborated that explanation format, rather than content, is often the -critical factor. +摘要:当提示逐步思考时,语言模型 (LM) 会产生一系列思考 (CoT),这是模型用来产生预测的一系列推理步骤。然而,尽管在 CoT 提示上做了很多工作,但尚不清楚 CoT 推理是否符合模型的参数化信念。我们引入了一个框架来衡量生成推理的参数化保真度,并提出了通过取消学习推理步骤 (FUR) 的保真度,这是该框架的一个实例。FUR 从模型参数中擦除推理步骤中包含的信息。我们执行实验,取消学习提示在四个多项选择问答 (MCQA) 数据集上的四个 LM 的 CoT。我们的实验表明,FUR 经常能够通过取消学习关键步骤来改变底层模型的预测,表明 CoT 在参数上是保真的。进一步的分析表明,模型在取消学习后生成的 CoT 支持不同的答案,暗示取消学习具有更深层次的影响。重要的是,FUR 确定的 CoT 步骤与人类对合理性的概念不太一致,强调了专门对齐的必要性 -摘要:由 AI 驅動的醫療保健工具被廣泛認為對醫療從業者和患者有潛在好處,例如用於癌症風險預測的 QCancer 回歸工具。然而,對於這些工具,如果要讓人們信賴,就需要補充說明。我們研究了說明的內容和格式如何影響使用者在解釋 QCancer 預測時的理解和信任。關於內容,我們部署了 SHAP 和 Occlusion-1。關於格式,我們以圖表 (SC) 的形式呈現 SHAP 說明,以圖表 (OC) 和文字 (OT) 的形式呈現 Occlusion-1 說明,因為它們的性質較為簡單。我們對兩組利害關係人進行了實驗:一般民眾(代表患者)和醫學生(代表醫療從業者)。我們的實驗結果顯示,基於內容,Occlusion-1 比 SHAP 說明具有更高的主觀理解和信任。然而,在控制格式時,只有 OT 優於 SC,這表明這種趨勢是由對文字的偏好所驅動的。其他發現證實了說明格式,而不是內容,通常是關鍵因素。 +##### **Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison** +2502.14827v1 by Aiswarya Baby, Tintu Thankom Koshy -##### **A Survey for Large Language Models in Biomedicine** -2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen +Visual Question Answering (VQA) has emerged as a pivotal task in the +intersection of computer vision and natural language processing, requiring +models to understand and reason about visual content in response to natural +language questions. Analyzing VQA datasets is essential for developing robust +models that can handle the complexities of multimodal reasoning. Several +approaches have been developed to examine these datasets, each offering +distinct perspectives on question diversity, answer distribution, and +visual-textual correlations. Despite significant progress, existing VQA models +face challenges related to dataset bias, limited model complexity, commonsense +reasoning gaps, rigid evaluation methods, and generalization to real world +scenarios. This paper presents a comprehensive comparative study of five +advanced VQA models: ABC-CNN, KICNLE, Masked Vision and Language Modeling, +BLIP-2, and OFA, each employing distinct methodologies to address these +challenges. -Recent breakthroughs in large language models (LLMs) offer unprecedented -natural language understanding and generation capabilities. However, existing -surveys on LLMs in biomedicine often focus on specific applications or model -architectures, lacking a comprehensive analysis that integrates the latest -advancements across various biomedical domains. This review, based on an -analysis of 484 publications sourced from databases including PubMed, Web of -Science, and arXiv, provides an in-depth examination of the current landscape, -applications, challenges, and prospects of LLMs in biomedicine, distinguishing -itself by focusing on the practical implications of these models in real-world -biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot -learning across a broad spectrum of biomedical tasks, including diagnostic -assistance, drug discovery, and personalized medicine, among others, with -insights drawn from 137 key studies. Then, we discuss adaptation strategies of -LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to -enhance their performance in specialized biomedical contexts where zero-shot -fails to achieve, such as medical question answering and efficient processing -of biomedical literature. Finally, we discuss the challenges that LLMs face in -the biomedicine domain including data privacy concerns, limited model -interpretability, issues with dataset quality, and ethics due to the sensitive -nature of biomedical data, the need for highly reliable model outputs, and the -ethical implications of deploying AI in healthcare. To address these -challenges, we also identify future research directions of LLM in biomedicine -including federated learning methods to preserve data privacy and integrating -explainable AI methodologies to enhance the transparency of LLMs. +摘要:視覺問答 (VQA) 已成為電腦視覺與自然語言處理交會中的關鍵任務,要求模型理解和推理視覺內容以回應自然語言問題。分析 VQA 資料集對於開發健全的模型至關重要,這些模型能夠處理多模態推理的複雜性。已經開發出多種方法來檢驗這些資料集,每種方法都提供有關問題多樣性、答案分佈和視覺文本關聯性的不同觀點。儘管有顯著進展,現有的 VQA 模型仍面臨與資料集偏差、模型複雜性有限、常識推理差距、僵化的評估方法和推廣到現實世界場景相關的挑戰。本文對五個先進的 VQA 模型進行了全面的比較研究:ABC-CNN、KICNLE、Masked Vision and Language Modeling、BLIP-2 和 OFA,每個模型都採用不同的方法來應對這些挑戰。 -摘要:大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而,現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構,缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析,深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景,其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先,我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力,包括診斷輔助、藥物發現和個性化醫療等,並從 137 項關鍵研究中汲取見解。然後,我們討論了 LLM 的適應策略,包括單模態和多模態 LLM 的微調方法,以增強它們在零次學習無法實現的專業生物醫學背景中的性能,例如醫療問題解答和生物醫學文獻的有效處理。最後,我們討論了 LLM 在生物醫學領域面臨的挑戰,包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰,我們還確定了生物醫學中 LLM 未來的研究方向,包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。 +##### **eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables** +2502.14820v1 by Luis Antonio Gutiérrez Guanilo, Mir Tafseer Nayeem, Cristian López, Davood Rafiei -##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis** -2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone +Large Language Models (LLMs) have demonstrated exceptional versatility across +diverse domains, yet their application in e-commerce remains underexplored due +to a lack of domain-specific datasets. To address this gap, we introduce +eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, +including detailed product attributes and user-specific queries. Leveraging +eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to +produce high-quality, attribute-specific product reviews from structured +tabular data. Fine-tuned models were rigorously evaluated using standard +Table2Text metrics, alongside correctness, faithfulness, and fluency +assessments. Our results demonstrate substantial improvements in generating +contextually accurate reviews, highlighting the transformative potential of +tailored datasets and fine-tuning methodologies in optimizing e-commerce +workflows. This work highlights the potential of LLMs in e-commerce workflows +and the essential role of domain-specific datasets in tailoring them to +industry-specific challenges. -Significant investment and development have gone into integrating Artificial -Intelligence (AI) in medical and healthcare applications, leading to advanced -control systems in medical technology. However, the opacity of AI systems -raises concerns about essential characteristics needed in such sensitive -applications, like transparency and trustworthiness. Our study addresses these -concerns by investigating a process for selecting the most adequate Explainable -AI (XAI) methods to comply with the explanation requirements of key EU -regulations in the context of smart bioelectronics for medical devices. The -adopted methodology starts with categorising smart devices by their control -mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving -into their technology. Then, we analyse these regulations to define their -explainability requirements for the various devices and related goals. -Simultaneously, we classify XAI methods by their explanatory objectives. This -allows for matching legal explainability requirements with XAI explanatory -goals and determining the suitable XAI algorithms for achieving them. Our -findings provide a nuanced understanding of which XAI algorithms align better -with EU regulations for different types of medical devices. We demonstrate this -through practical case studies on different neural implants, from chronic -disease management to advanced prosthetics. This study fills a crucial gap in -aligning XAI applications in bioelectronics with stringent provisions of EU -regulations. It provides a practical framework for developers and researchers, -ensuring their AI innovations advance healthcare technology and adhere to legal -and ethical standards. +摘要:大型語言模型 (LLM) 在各種領域展現出非凡的多功能性,但由於缺乏特定領域的資料集,因此它們在電子商務中的應用仍未得到充分探索。為了解決這個差距,我們引入了 eC-Tab2Text,這是一個新穎的資料集,旨在捕捉電子商務的複雜性,包括詳細的產品屬性和使用者特定的查詢。利用 eC-Tab2Text,我們專注於從產品表格中產生文字,使 LLM 能夠從結構化的表格資料中產生高品質、特定屬性的產品評論。微調模型使用標準的 Table2Text 指標,以及正確性、忠實度和流利度評估進行嚴格評估。我們的結果證明在產生符合語境的準確評論方面有顯著的進步,突顯了客製化資料集和微調方法在最佳化電子商務工作流程中的轉型潛力。這項工作突顯了 LLM 在電子商務工作流程中的潛力,以及特定領域資料集在因應產業特定挑戰中至關重要的角色。 -摘要:人工智慧(AI)在醫療和保健應用中投入了大量的投資和開發,進而導致醫療技術中的先進控制系統。然而,AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂,例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題,用於選擇最充分的可解釋 AI(XAI)方法,以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制(開迴路、閉迴路和半閉迴路系統)對智慧型裝置進行分類,並深入探討其技術開始。然後,我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時,我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配,並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點,從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構,確保其 AI 創新能促進醫療技術並遵守法律和道德標準。 +##### **Optimizing Model Selection for Compound AI Systems** +2502.14815v1 by Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica -##### **Towards Case-based Interpretability for Medical Federated Learning** -2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva +Compound AI systems that combine multiple LLM calls, such as self-refine and +multi-agent-debate, achieve strong performance on many AI tasks. We address a +core question in optimizing compound systems: for each LLM call or module in +the system, how should one decide which LLM to use? We show that these LLM +choices have a large effect on quality, but the search space is exponential. We +propose LLMSelector, an efficient framework for model selection in compound +systems, which leverages two key empirical insights: (i) end-to-end performance +is often monotonic in how well each module performs, with all other modules +held fixed, and (ii) per-module performance can be estimated accurately by an +LLM. Building upon these insights, LLMSelector iteratively selects one module +and allocates to it the model with the highest module-wise performance, as +estimated by an LLM, until no further gain is possible. LLMSelector is +applicable to any compound system with a bounded number of modules, and its +number of API calls scales linearly with the number of modules, achieving +high-quality model allocation both empirically and theoretically. Experiments +with popular compound systems such as multi-agent debate and self-refine using +LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector +confers 5%-70% accuracy gains compared to using the same LLM for all modules. -We explore deep generative models to generate case-based explanations in a -medical federated learning setting. Explaining AI model decisions through -case-based interpretability is paramount to increasing trust and allowing -widespread adoption of AI in clinical practice. However, medical AI training -paradigms are shifting towards federated learning settings in order to comply -with data protection regulations. In a federated scenario, past data is -inaccessible to the current user. Thus, we use a deep generative model to -generate synthetic examples that protect privacy and explain decisions. Our -proof-of-concept focuses on pleural effusion diagnosis and uses publicly -available Chest X-ray data. +摘要:複合式 AI 系統結合多個 LLM 呼叫,例如自我精煉和多代理辯論,在許多 AI 任務中都能獲得強大的效能。我們解決了最佳化複合式系統中的核心問題:對於系統中的每個 LLM 呼叫或模組,應該如何決定要使用哪個 LLM?我們表明這些 LLM 選擇對品質有很大的影響,但搜尋空間是呈指數增長的。我們提出 LLMSelector,一種用於複合式系統中模型選擇的有效架構,它利用了兩個主要的經驗見解:(i) 端對端效能通常會隨著每個模組執行得有多好而單調變化,而其他所有模組保持固定,以及 (ii) 每個模組的效能都可以由 LLM 精準估計。LLMSelector 建立在這些見解之上,反覆選擇一個模組,並根據 LLM 估計的模組最佳效能,將模型分配給它,直到無法再進一步提升為止。LLMSelector 適用於任何具有有限數量的模組的複合式系統,其 API 呼叫數量與模組數量成線性比例,在經驗和理論上都實現了高品質的模型配置。使用 GPT-4o、Claude 3.5 Sonnet 和 Gemini 1.5 等 LLM,對多代理辯論和自我精煉等熱門複合式系統進行的實驗表明,與對所有模組使用相同的 LLM 相比,LLMSelector 可帶來 5%-70% 的準確度提升。 -摘要:我們探索深度生成模型,在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策,對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而,醫療 AI 訓練範例正轉向聯邦學習設置,以符合資料保護法規。在聯邦情境中,過去的資料對目前的使用者而言是無法取得的。因此,我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷,並使用公開可取得的胸部 X 光資料。 +##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** +2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub -##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines** -2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans +Foundation models are becoming increasingly effective in the medical domain, +offering pre-trained models on large datasets that can be readily adapted for +downstream tasks. Despite progress, fetal ultrasound images remain a +challenging domain for foundation models due to their inherent complexity, +often requiring substantial additional training and facing limitations due to +the scarcity of paired multimodal data. To overcome these challenges, here we +introduce FetalCLIP, a vision-language foundation model capable of generating +universal representation of fetal ultrasound images. FetalCLIP was pre-trained +using a multimodal learning approach on a diverse dataset of 210,035 fetal +ultrasound images paired with text. This represents the largest paired dataset +of its kind used for foundation model development to date. This unique training +approach allows FetalCLIP to effectively learn the intricate anatomical +features present in fetal ultrasound images, resulting in robust +representations that can be used for a variety of downstream applications. In +extensive benchmarking across a range of key fetal ultrasound applications, +including classification, gestational age estimation, congenital heart defect +(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all +baselines while demonstrating remarkable generalizability and strong +performance even with limited labeled data. We plan to release the FetalCLIP +model publicly for the benefit of the broader scientific community. -Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging -lesions with variable clinical behaviours and treatment approaches. This -systematic review provides an overview of Artificial Intelligence (AI) methods -using radiological imaging for diagnosis and prognosis of these tumours, -highlighting challenges in clinical translation, and evaluating study alignment -with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI -international consensus guidelines for trustworthy and deployable AI to promote -the clinical translation of AI methods. The review covered literature from -several bibliographic databases, including papers published before 17/07/2024. -Original research in peer-reviewed journals focused on radiology-based AI for -diagnosing or prognosing primary STBT was included. Exclusion criteria were -animal, cadaveric, or laboratory studies, and non-English papers. Abstracts -were screened by two of three independent reviewers for eligibility. Eligible -papers were assessed against guidelines by one of three independent reviewers. -The search identified 15,015 abstracts, from which 325 articles were included -for evaluation. Most studies performed moderately on CLAIM, averaging a score -of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out -of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage, -indicating significant room for improvement. Future efforts by AI developers -should focus on design (e.g. define unmet clinical need, intended clinical -setting and how AI would be integrated in clinical workflow), development (e.g. -build on previous work, explainability), evaluation (e.g. evaluating and -addressing biases, evaluating AI against best practices), and data -reproducibility and availability (making documented code and data publicly -available). Following these recommendations could improve clinical translation -of AI methods. +摘要:基礎模型在醫療領域正變得越來越有效, +提供在大型資料集上預先訓練的模型,可輕鬆適應 +下游任務。儘管有進展,但胎兒超音波影像仍然是 +基礎模型的挑戰領域,因為它們固有的複雜性, +通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 +介紹 FetalCLIP,一種能夠產生 +胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 +超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 +方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 +表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 +(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 +效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 -摘要:軟組織和骨骼腫瘤(STBT)是罕見、診斷具有挑戰性的病灶,其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀,重點說明了臨床轉譯的挑戰,並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性,以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻,包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究,以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要,其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等,平均得分為 53 分中的 28.9±7.5 分,但在 FUTURE-AI 中表現不佳,平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段,表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計(例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中)、開發(例如建立在先前的工作、可解釋性)、評估(例如評估和解決偏差、評估 AI 與最佳實務)、以及數據可複製性和可用性(公開提供文件化的代碼和數據)。遵循這些建議可以改善 AI 方法的臨床轉譯。 +##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** +2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su -##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy** -2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen +Our ability to continuously acquire, organize, and leverage knowledge is a +key feature of human intelligence that AI systems must approximate to unlock +their full potential. Given the challenges in continual learning with large +language models (LLMs), retrieval-augmented generation (RAG) has become the +dominant way to introduce new information. However, its reliance on vector +retrieval hinders its ability to mimic the dynamic and interconnected nature of +human long-term memory. Recent RAG approaches augment vector embeddings with +various structures like knowledge graphs to address some of these gaps, namely +sense-making and associativity. However, their performance on more basic +factual memory tasks drops considerably below standard RAG. We address this +unintended deterioration and propose HippoRAG 2, a framework that outperforms +standard RAG comprehensively on factual, sense-making, and associative memory +tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in +HippoRAG and enhances it with deeper passage integration and more effective +online use of an LLM. This combination pushes this RAG system closer to the +effectiveness of human long-term memory, achieving a 7% improvement in +associative memory tasks over the state-of-the-art embedding model while also +exhibiting superior factual knowledge and sense-making memory capabilities. +This work paves the way for non-parametric continual learning for LLMs. Our +code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. -Early detection of Cerebral Palsy (CP) is crucial for effective intervention -and monitoring. This paper tests the reliability and applicability of -Explainable AI (XAI) methods using a deep learning method that predicts CP by -analyzing skeletal data extracted from video recordings of infant movements. -Specifically, we use XAI evaluation metrics -- namely faithfulness and -stability -- to quantitatively assess the reliability of Class Activation -Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this -specific medical application. We utilize a unique dataset of infant movements -and apply skeleton data perturbations without distorting the original dynamics -of the infant movements. Our CP prediction model utilizes an ensemble approach, -so we evaluate the XAI metrics performances for both the overall ensemble and -the individual models. Our findings indicate that both XAI methods effectively -identify key body points influencing CP predictions and that the explanations -are robust against minor data perturbations. Grad-CAM significantly outperforms -CAM in the RISv metric, which measures stability in terms of velocity. In -contrast, CAM performs better in the RISb metric, which relates to bone -stability, and the RRS metric, which assesses internal representation -robustness. Individual models within the ensemble show varied results, and -neither CAM nor Grad-CAM consistently outperform the other, with the ensemble -approach providing a representation of outcomes from its constituent models. +摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 -摘要:腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性,使用深度學習方法,透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說,我們使用 XAI 評估指標(即忠實度和穩定性)來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集,並應用骨骼資料擾動,而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法,因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明,兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位,並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM,該指標衡量速度方面的穩定性。相比之下,CAM 在 RISb 指標中表現得更好,該指標與骨骼穩定性有關,而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果,CAM 和 Grad-CAM 都不一致地優於另一種,整體方法提供了其組成模型結果的表示。 +##### **A Survey on Text-Driven 360-Degree Panorama Generation** +2502.14799v1 by Hai Wang, Xiaoyu Xiang, Weihao Xia, Jing-Hao Xue -##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy** -2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma +The advent of text-driven 360-degree panorama generation, enabling the +synthesis of 360-degree panoramic images directly from textual descriptions, +marks a transformative advancement in immersive visual content creation. This +innovation significantly simplifies the traditionally complex process of +producing such content. Recent progress in text-to-image diffusion models has +accelerated the rapid development in this emerging field. This survey presents +a comprehensive review of text-driven 360-degree panorama generation, offering +an in-depth analysis of state-of-the-art algorithms and their expanding +applications in 360-degree 3D scene generation. Furthermore, we critically +examine current limitations and propose promising directions for future +research. A curated project page with relevant resources and research papers is +available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/. -Recent global estimates suggest that as many as 2.41 billion individuals have -health conditions that would benefit from rehabilitation services. Home-based -Physical Therapy (PT) faces significant challenges in providing interactive -feedback and meaningful observation for therapists and patients. To fill this -gap, we present MicroXercise, which integrates micro-motion analysis with -wearable sensors, providing therapists and patients with a comprehensive -feedback interface, including video, text, and scores. Crucially, it employs -multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable -methods to analyze the existing deep learning neural networks in monitoring -exercises, focusing on a high granularity of exercise. This synergistic -approach is pivotal, providing output matching the input size to precisely -highlight critical subtleties and movements in PT, thus transforming complex AI -analysis into clear, actionable feedback. By highlighting these micro-motions -in different metrics, such as stability and range of motion, MicroXercise -significantly enhances the understanding and relevance of feedback for -end-users. Comparative performance metrics underscore its effectiveness over -traditional methods, such as a 39% and 42% improvement in Feature Mutual -Information (FMI) and Continuity. MicroXercise is a step ahead in home-based -physical therapy, providing a technologically advanced and intuitively helpful -solution to enhance patient care and outcomes. +摘要:文字驅動 360 度全景圖生成技術的出現,使能從文字描述中直接合成 360 度全景圖像,標誌著沉浸式視覺內容創作的變革性進展。這項創新顯著簡化了傳統上複雜的製作此類內容的過程。最近在文字轉圖像擴散模型方面的進展加速了這個新興領域的快速發展。本調查提供了對文字驅動 360 度全景圖生成的全面回顧,深入分析了最先進的演算法及其在 360 度 3D 場景生成中的擴展應用。此外,我們批判性地審視了當前的限制,並提出了未來研究的有希望的方向。一個精選的專案頁面,其中包含相關資源和研究論文,可在 https://littlewhitesea.github.io/Text-Driven-Pano-Gen/ 獲得。 -摘要:最近的全球估計表明,多達 24.1 億人有 -健康狀況可從復健服務中受益。居家 -物理治療 (PT) 在提供互動式 -回饋和有意義的觀察方面面臨重大挑戰,供治療師和患者使用。為了填補這 -個缺口,我們提出 MicroXercise,它將微動作分析與 -可穿戴式感測器整合在一起,為治療師和患者提供一個全面的 -回饋介面,包括影片、文字和分數。至關重要的是,它採用 -多維動態時間規整 (DTW) 和基於歸因的可解釋 -方法來分析監控運動中現有的深度學習神經網路,專注於運動的高粒度。這種協同 -方法至關重要,提供與輸入大小匹配的輸出,以精確地 -突出 PT 中關鍵的細微差別和動作,從而將複雜的 AI -分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作,例如穩定性和動作範圍,MicroXercise -顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於 -傳統方法的有效性,例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家 -物理治療方面更進一步,提供技術先進且直覺有用的 -解決方案,以提升患者照護和結果。 +##### **Rapid Word Learning Through Meta In-Context Learning** +2502.14791v1 by Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake -##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development** -2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz +Humans can quickly learn a new word from a few illustrative examples, and +then systematically and flexibly use it in novel contexts. Yet the abilities of +current language models for few-shot word learning, and methods for improving +these abilities, are underexplored. In this study, we introduce a novel method, +Meta-training for IN-context learNing Of Words (Minnow). This method trains +language models to generate new examples of a word's usage given a few +in-context examples, using a special placeholder token to represent the new +word. This training is repeated on many new words to develop a general +word-learning ability. We find that training models from scratch with Minnow on +human-scale child-directed language enables strong few-shot word learning, +comparable to a large language model (LLM) pre-trained on orders of magnitude +more data. Furthermore, through discriminative and generative evaluations, we +demonstrate that finetuning pre-trained LLMs with Minnow improves their ability +to discriminate between new words, identify syntactic categories of new words, +and generate reasonable new usages and definitions for new words, based on one +or a few in-context examples. These findings highlight the data efficiency of +Minnow and its potential to improve language model performance in word learning +tasks. -Systematic literature reviews are the highest quality of evidence in -research. However, the review process is hindered by significant resource and -data constraints. The Literature Review Network (LRN) is the first of its kind -explainable AI platform adhering to PRISMA 2020 standards, designed to automate -the entire literature review process. LRN was evaluated in the domain of -surgical glove practices using 3 search strings developed by experts to query -PubMed. A non-expert trained all LRN models. Performance was benchmarked -against an expert manual review. Explainability and performance metrics -assessed LRN's ability to replicate the experts' review. Concordance was -measured with the Jaccard index and confusion matrices. Researchers were -blinded to the other's results until study completion. Overlapping studies were -integrated into an LRN-generated systematic review. LRN models demonstrated -superior classification accuracy without expert training, achieving 84.78% and -85.71% accuracy. The highest performance model achieved high interrater -reliability (k = 0.4953) and explainability metrics, linking 'reduce', -'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51% -of the relevant literature despite diverging from the non-expert's judgments (k -= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN -outperformed the manual review (19,920 minutes over 11 months), reducing the -entire process to 288.6 minutes over 5 days. This study demonstrates that -explainable AI does not require expert training to successfully conduct -PRISMA-compliant systematic literature reviews like an expert. LRN summarized -the results of surgical glove studies and identified themes that were nearly -identical to the clinical researchers' findings. Explainable AI can accurately -expedite our understanding of clinical practices, potentially revolutionizing -healthcare research. +摘要:人類可以從幾個說明性的範例中快速學習一個新字詞,然後系統性且靈活地將其用於新的脈絡中。然而,目前語言模型在少量字詞學習中的能力,以及改善這些能力的方法,尚未得到充分探討。在這項研究中,我們引入了一種新方法,即「用於字詞情境學習的元訓練」(Minnow)。此方法訓練語言模型在給定幾個情境範例的情況下,產生字詞用法的範例,並使用特殊佔位符標記來表示新的字詞。此訓練會在許多新字詞上重複進行,以培養一般的字詞學習能力。我們發現,從頭開始使用 Minnow 在人類規模的兒童導向語言上訓練模型,可以實現強大的少量字詞學習能力,這與預先在大量資料上訓練的大型語言模型 (LLM) 相當。此外,透過區辨性和生成性評估,我們證明使用 Minnow 微調預先訓練的 LLM 可以提升其區辨新字詞、識別新字詞的句法類別,以及根據一個或幾個情境範例產生合理的新用法和定義的能力。這些發現突顯了 Minnow 的資料效率,以及它在字詞學習任務中提升語言模型效能的潛力。 -摘要:系統性文獻回顧是研究中證據品質最高的。然而,回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台,旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估,使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率,達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標,將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻,儘管與非專家的判斷不同 (k = 0.2174),但包含了「乳膠」、「雙重」(手套)和「適應症」等詞彙。LRN 優於手動回顧(11 個月超過 19,920 分鐘),將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示,可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果,並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解,有潛力革新醫療保健研究。 +##### **SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features** +2502.14786v1 by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai -##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns** -2408.02709v1 by Chi Him Ng +We introduce SigLIP 2, a family of new multilingual vision-language encoders +that build on the success of the original SigLIP. In this second iteration, we +extend the original image-text training objective with several prior, +independently developed techniques into a unified recipe -- this includes +captioning-based pretraining, self-supervised losses (self-distillation, masked +prediction) and online data curation. With these changes, SigLIP 2 models +outperform their SigLIP counterparts at all model scales in core capabilities, +including zero-shot classification, image-text retrieval, and transfer +performance when extracting visual representations for Vision-Language Models +(VLMs). Furthermore, the new training recipe leads to significant improvements +on localization and dense prediction tasks. We also train variants which +support multiple resolutions and preserve the input's native aspect ratio. +Finally, we train on a more diverse data-mixture that includes de-biasing +techniques, leading to much better multilingual understanding and improved +fairness. To allow users to trade off inference cost with performance, we +release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), +and g (1B). -This study analyzes hybrid AI systems' design patterns and their -effectiveness in clinical decision-making using the boxology framework. It -categorizes and copares various architectures combining machine learning and -rule-based reasoning to provide insights into their structural foundations and -healthcare applications. Addressing two main questions, how to categorize these -systems againts established design patterns and how to extract insights through -comparative analysis, the study uses design patterns from software engineering -to understand and optimize healthcare AI systems. Boxology helps identify -commonalities and create reusable solutions, enhancing these systems' -scalability, reliability, and performance. Five primary architectures are -examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and -weaknesses, highlighting the need for tailored approaches in clinical tasks. -REML excels in high-accuracy prediction for datasets with limited data; MLRB in -handling large datasets and complex data integration; RBML in explainability -and trustworthiness; RMLT in managing high-dimensional data; and PERML, though -limited in analysis, shows promise in urgent care scenarios. The study -introduces four new patterns, creates five abstract categorization patterns, -and refines those five further to specific systems. These contributions enhance -Boxlogy's taxonomical organization and offer novel approaches to integrating -expert knowledge with machine learning. Boxology's structured, modular apporach -offers significant advantages in developing and analyzing hybrid AI systems, -revealing commonalities, and promoting reusable solutions. In conclusion, this -study underscores hybrid AI systems' crucial role in advancing healthcare and -Boxology's potential to drive further innovation in AI integration, ultimately -improving clinical decision support and patient outcomes. +摘要:我們推出了 SigLIP 2,這是一個新的多語言視覺語言編碼器系列,它建立在 SigLIP 的成功基礎上。在這個第二個版本中,我們將原來的圖像文字訓練目標與幾個先前獨立開發的技術擴展到一個統一的配方中,其中包括基於標題的預訓練、自我監督損失(自我蒸餾、遮罩預測)和線上數據策展。有了這些改變,SigLIP 2 模型在所有模型規模上都超越了 SigLIP 的對應模型,包括零次分類、圖像文字檢索和在為視覺語言模型 (VLM) 提取視覺表示時傳輸效能。此外,新的訓練配方也大幅改善了定位和密集預測任務。我們還訓練了支援多種解析度和保留輸入原生長寬比的變體。最後,我們在一個更為多樣化的數據組合上進行訓練,其中包括去偏見技術,從而大幅提升多語言理解力並改善公平性。為了讓使用者權衡推理成本與效能,我們發布了四種大小的模型檢查點:ViT-B (86M)、L (303M)、So400m (400M) 和 g (1B)。 -摘要:本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構,以深入了解其結構基礎和醫療保健應用。針對兩個主要問題,如何根據既定的設計模式對這些系統進行分類,以及如何通過比較分析提取見解,本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案,從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構:REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點,強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測;MLRB 在處理大型資料集和複雜資料整合方面表現出色;RBML 在可解釋性和可信度方面表現出色;RMLT 在管理高維資料方面表現出色;而 PERML 儘管在分析方面有限,但在緊急照護場景中表現出潛力。本研究引入了四種新模式,建立了五種抽象分類模式,並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織,並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之,本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用,以及盒子學在推動人工智慧整合進一步創新方面的潛力,最終改善臨床決策支援和患者的治療成果。 +##### **ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting** +2502.14780v1 by Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, Minji Kim -##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability** -2408.02706v1 by Masoud Muhammed Hassan +Efficient and privacy-preserving multimodal interaction is essential as AR, +VR, and modern smartphones with powerful cameras become primary interfaces for +human-computer communication. Existing powerful large vision-language models +(VLMs) enabling multimodal interaction often rely on cloud-based processing, +raising significant concerns about (1) visual privacy by transmitting sensitive +vision data to servers, and (2) their limited real-time, on-device usability. +This paper explores Visual Instruction Rewriting, a novel approach that +transforms multimodal instructions into text-only commands, allowing seamless +integration of lightweight on-device instruction rewriter VLMs (250M +parameters) with existing conversational AI systems, enhancing vision data +privacy. To achieve this, we present a dataset of over 39,000 examples across +14 domains and develop a compact VLM, pretrained on image captioning datasets +and fine-tuned for instruction rewriting. Experimental results, evaluated +through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic +parsing analysis, demonstrate that even a quantized version of the model +(<500MB storage footprint) can achieve effective instruction rewriting, thus +enabling privacy-focused, multimodal AI applications. -Because of its strong predictive skills, deep learning has emerged as an -essential tool in many industries, including healthcare. Traditional deep -learning models, on the other hand, frequently lack interpretability and omit -to take prediction uncertainty into account two crucial components of clinical -decision making. In order to produce explainable and uncertainty aware -predictions, this study presents a novel framework called Bayesian Kolmogorov -Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov -Arnold Networks with Bayesian inference. We employ BKANs on two medical -datasets, which are widely used benchmarks for assessing machine learning -models in medical diagnostics: the Pima Indians Diabetes dataset and the -Cleveland Heart Disease dataset. Our method provides useful insights into -prediction confidence and decision boundaries and outperforms traditional deep -learning models in terms of prediction accuracy. Moreover, BKANs' capacity to -represent aleatoric and epistemic uncertainty guarantees doctors receive more -solid and trustworthy decision support. Our Bayesian strategy improves the -interpretability of the model and considerably minimises overfitting, which is -important for tiny and imbalanced medical datasets, according to experimental -results. We present possible expansions to further use BKANs in more -complicated multimodal datasets and address the significance of these -discoveries for future research in building reliable AI systems for healthcare. -This work paves the way for a new paradigm in deep learning model deployment in -vital sectors where transparency and reliability are crucial. +摘要:高效且重視隱私的多模態互動至關重要,因為 AR、VR 和配備強大相機的現代智慧型手機已成為人機溝通的主要介面。現有的強大大型視覺語言模型 (VLM) 能支援多模態互動,通常仰賴雲端處理,這引發了重大的疑慮,包括:(1) 將敏感的視覺資料傳輸至伺服器,會造成視覺隱私問題,以及 (2) 其有限的即時、裝置上可用性。本文探討視覺指令改寫,這是一種新穎的方法,可將多模態指令轉換為純文字指令,讓輕量級的裝置上指令改寫 VLM (250M 參數) 與現有的對話式 AI 系統無縫整合,進而強化視覺資料的隱私。為達成此目標,我們提供一個跨越 14 個領域、超過 39,000 個範例的資料集,並開發一個精簡的 VLM,在圖片標題資料集上進行預訓練,並針對指令改寫進行微調。實驗結果透過 NLG 指標(例如 BLEU、METEOR 和 ROUGE)以及語意解析分析進行評估,證明即使是模型的量化版本(<500MB 儲存空間佔用量)也能有效執行指令改寫,進而支援注重隱私的多模態 AI 應用程式。 + +##### **Harnessing PDF Data for Improving Japanese Large Multimodal Models** +2502.14778v1 by Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa + +Large Multimodal Models (LMMs) have demonstrated strong performance in +English, but their effectiveness in Japanese remains limited due to the lack of +high-quality training data. Current Japanese LMMs often rely on translated +English datasets, restricting their ability to capture Japan-specific cultural +knowledge. To address this, we explore the potential of Japanese PDF data as a +training resource, an area that remains largely underutilized. We introduce a +fully automated pipeline that leverages pretrained models to extract image-text +pairs from PDFs through layout analysis, OCR, and vision-language pairing, +removing the need for manual annotation. Additionally, we construct instruction +data from extracted image-text pairs to enrich the training data. To evaluate +the effectiveness of PDF-derived data, we train Japanese LMMs and assess their +performance on the Japanese LMM Benchmark. Our results demonstrate substantial +improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. +Further analysis highlights the impact of PDF-derived data on various factors, +such as model size and language models, reinforcing its value as a multimodal +resource for Japanese LMMs. We plan to make the source code and data publicly +available upon acceptance. -摘要:由於其強大的預測能力,深度學習已成為許多產業中不可或缺的工具,包括醫療保健。然而,傳統的深度學習模型通常缺乏可解釋性,並且忽略了將預測不確定性納入考量,而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測,本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構,它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN,這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準:皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解,並且在預測準確度方面優於傳統的深度學習模型。此外,BKAN 表現隨機和認識不確定性的能力,可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果,我們的貝氏策略提高了模型的可解釋性,並大幅減少了過度擬合,這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能,以進一步將 BKAN 用於更複雜的多模式資料集,並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。 +摘要:大型多模態模型 (LMM) 已在英語中表現出強勁的效能,但由於缺乏高品質的訓練資料,它們在日語中的效能仍然有限。目前的日語 LMM 通常依賴於翻譯後的英語資料集,限制了它們擷取特定於日本的文化知識的能力。為了解決這個問題,我們探索了日語 PDF 資料作為訓練資源的潛力,這個領域在很大程度上仍然未被充分利用。我們引入了一個全自動的管道,利用預先訓練好的模型透過版面分析、光學字元辨識和視覺語言配對從 PDF 中擷取影像文字對,消除了手動註解的需要。此外,我們從擷取的影像文字對中建構說明資料,以豐富訓練資料。為了評估 PDF 衍生資料的效能,我們訓練了日語 LMM,並在日語 LMM 基準上評估它們的效能。我們的結果證明了顯著的進步,在 Heron-Bench 上的效能提升幅度從 3.9% 到 13.8%。進一步的分析重點說明了 PDF 衍生資料對各種因素的影響,例如模型大小和語言模型,加強了其作為日語 LMM 的多模態資源的價值。我們計畫在接受後公開原始程式碼和資料。 -##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI** -2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal +##### **Making Universal Policies Universal** +2502.14777v1 by Niklas Höpner, David Kuric, Herke van Hoof -In modern healthcare, addressing the complexities of accurate disease -prediction and personalized recommendations is both crucial and challenging. -This research introduces MLtoGAI, which integrates Semantic Web technology with -Machine Learning (ML) to enhance disease prediction and offer user-friendly -explanations through ChatGPT. The system comprises three key components: a -reusable disease ontology that incorporates detailed knowledge about various -diseases, a diagnostic classification model that uses patient symptoms to -detect specific diseases accurately, and the integration of Semantic Web Rule -Language (SWRL) with ontology and ChatGPT to generate clear, personalized -health advice. This approach significantly improves prediction accuracy and -ensures results that are easy to understand, addressing the complexity of -diseases and diverse symptoms. The MLtoGAI system demonstrates substantial -advancements in accuracy and user satisfaction, contributing to developing more -intelligent and accessible healthcare solutions. This innovative approach -combines the strengths of ML algorithms with the ability to provide -transparent, human-understandable explanations through ChatGPT, achieving -significant improvements in prediction accuracy and user comprehension. By -leveraging semantic technology and explainable AI, the system enhances the -accuracy of disease prediction and ensures that the recommendations are -relevant and easily understood by individual patients. Our research highlights -the potential of integrating advanced technologies to overcome existing -challenges in medical diagnostics, paving the way for future developments in -intelligent healthcare systems. Additionally, the system is validated using 200 -synthetic patient data records, ensuring robust performance and reliability. +The development of a generalist agent capable of solving a wide range of +sequential decision-making tasks remains a significant challenge. We address +this problem in a cross-agent setup where agents share the same observation +space but differ in their action spaces. Our approach builds on the universal +policy framework, which decouples policy learning into two stages: a +diffusion-based planner that generates observation sequences and an inverse +dynamics model that assigns actions to these plans. We propose a method for +training the planner on a joint dataset composed of trajectories from all +agents. This method offers the benefit of positive transfer by pooling data +from different agents, while the primary challenge lies in adapting shared +plans to each agent's unique constraints. We evaluate our approach on the +BabyAI environment, covering tasks of varying complexity, and demonstrate +positive transfer across agents. Additionally, we examine the planner's +generalisation ability to unseen agents and compare our method to traditional +imitation learning approaches. By training on a pooled dataset from multiple +agents, our universal policy achieves an improvement of up to $42.20\%$ in task +completion accuracy compared to a policy trained on a dataset from a single +agent. -摘要:在現代醫療保健中,解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI,它將語義網路技術與機器學習 (ML) 相結合,以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分:一個可重複使用的疾病本体,其中包含有關各種疾病的詳細知識;一個診斷分類模型,它使用患者症狀來準確檢測特定疾病;以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合,以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性,並確保了易於理解的結果,解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步,有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點,以及透過 ChatGPT 提供透明且人類可以理解的說明的能力,在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI,該系統提高了疾病預測的準確性,並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力,為智慧醫療保健系統的未來發展鋪路。此外,該系統使用 200 個合成患者資料記錄進行驗證,確保了穩健的效能和可靠性。 +摘要:開發一種能夠解決廣泛順序決策任務的通才代理仍然是一項重大挑戰。我們在跨代理設置中解決這個問題,其中代理共享相同的觀察空間,但在其動作空間中有所不同。我們的做法建立在通用策略框架之上,該框架將策略學習解耦為兩個階段:生成觀察序列的基於擴散的規劃器和將動作分配給這些計劃的逆動態模型。我們提出了一種在由所有代理的軌跡組成的聯合數據集上訓練規劃器的方法。這種方法提供了通過彙總來自不同代理的數據來進行正向傳輸的好處,而主要的挑戰在於將共享計劃適應於每個代理的唯一約束。我們在 BabyAI 環境中評估了我們的做法,涵蓋了不同複雜程度的任務,並展示了跨代理的正向傳輸。此外,我們檢查了規劃器對未見代理的概括能力,並將我們的做法與傳統的模仿學習方法進行了比較。通過在來自多個代理的彙總數據集上進行訓練,我們的通用策略在任務完成準確度方面實現了高達 42.20% 的改進,而從單個代理的數據集上訓練的策略。 -##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations** -2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora +##### **SurveyX: Academic Survey Automation via Large Language Models** +2502.14776v1 by Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li -Explainable Artificial Intelligence (XAI) is central to the debate on -integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms -into clinical practice. High-performing AI/ML models, such as ensemble learners -and deep neural networks, often lack interpretability, hampering clinicians' -trust in their predictions. To address this, XAI techniques are being developed -to describe AI/ML predictions in human-understandable terms. One promising -direction is the adaptation of sensitivity analysis (SA) and global sensitivity -analysis (GSA), which inherently rank model inputs by their impact on -predictions. Here, we introduce a novel delta-XAI method that provides local -explanations of ML model predictions by extending the delta index, a GSA -metric. The delta-XAI index assesses the impact of each feature's value on the -predicted output for individual instances in both regression and classification -problems. We formalize the delta-XAI index and provide code for its -implementation. The delta-XAI method was evaluated on simulated scenarios using -linear regression models, with Shapley values serving as a benchmark. Results -showed that the delta-XAI index is generally consistent with Shapley values, -with notable discrepancies in models with highly impactful or extreme feature -values. The delta-XAI index demonstrated higher sensitivity in detecting -dominant features and handling extreme feature values. Qualitatively, the -delta-XAI provides intuitive explanations by leveraging probability density -functions, making feature rankings clearer and more explainable for -practitioners. Overall, the delta-XAI method appears promising for robustly -obtaining local explanations of ML model predictions. Further investigations in -real-world clinical settings will be conducted to evaluate its impact on -AI-assisted clinical workflows. +Large Language Models (LLMs) have demonstrated exceptional comprehension +capabilities and a vast knowledge base, suggesting that LLMs can serve as +efficient tools for automated survey generation. However, recent research +related to automated survey generation remains constrained by some critical +limitations like finite context window, lack of in-depth content discussion, +and absence of systematic evaluation frameworks. Inspired by human writing +processes, we propose SurveyX, an efficient and organized system for automated +survey generation that decomposes the survey composing process into two phases: +the Preparation and Generation phases. By innovatively introducing online +reference retrieval, a pre-processing method called AttributeTree, and a +re-polishing process, SurveyX significantly enhances the efficacy of survey +composition. Experimental evaluation results show that SurveyX outperforms +existing automated survey generation systems in content quality (0.259 +improvement) and citation quality (1.76 enhancement), approaching human expert +performance across multiple evaluation dimensions. Examples of surveys +generated by SurveyX are available on www.surveyx.cn -摘要:可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型,例如整體學習器和深度神經網路,通常缺乏可解釋性,阻礙臨床醫生對其預測的信任。為了解決這個問題,正在開發 XAI 技術,以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA),它們本質上會依據模型輸入對預測的影響來對其進行排名。在此,我們介紹一種新的 delta-XAI 方法,透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化,並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法,並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致,但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說,delta-XAI 透過利用機率密度函數提供直觀的解釋,使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言,delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查,以評估其對 AI 輔助臨床工作流程的影響。 +摘要:大型語言模型 (LLM) 已展現出卓越的理解能力和廣泛的知識庫,表示 LLM 可作為自動調查生成的有用工具。然而,與自動調查生成相關的最新研究仍受到一些關鍵限制的約束,例如有限的上下文視窗、缺乏深入的內容討論以及系統評估架構的缺失。受到人類寫作過程的啟發,我們提出 SurveyX,這是一個用於自動調查生成的有效且有組織的系統,它將調查組成過程分解為兩個階段:準備和生成階段。透過創新地引入線上參考檢索、一種稱為 AttributeTree 的預處理方法和重新潤飾過程,SurveyX 大幅提升了調查組成的效能。實驗評估結果顯示,SurveyX 在內容品質(提升 0.259)和引用品質(提升 1.76)方面優於現有的自動調查生成系統,在多個評估面向中接近人類專家的表現。由 SurveyX 生成的調查範例可在 www.surveyx.cn 取得 -##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population** -2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis +##### **Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning** +2502.14768v1 by Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo -Dementia, a debilitating neurological condition affecting millions worldwide, -presents significant diagnostic challenges. In this work, we introduce a novel -methodology for the classification of demented and non-demented elderly -patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach -features a unique technique for selectively processing MRI slices, focusing on -the most relevant brain regions and excluding less informative sections. This -methodology is complemented by a confidence-based classification committee -composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and -Dem3D EfficientNet. These models work synergistically to enhance -decision-making accuracy, leveraging their collective strengths. Tested on the -Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an -impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, -validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset -confirmed the robustness and generalizability of our approach. The use of -explainable AI (XAI) techniques and comprehensive ablation studies further -substantiate the effectiveness of our techniques, providing insights into the -decision-making process and the importance of our methodology. This research -offers a significant advancement in dementia diagnosis, providing a highly -accurate and efficient tool for clinical applications. +Inspired by the success of DeepSeek-R1, we explore the potential of +rule-based reinforcement learning (RL) in large reasoning models. To analyze +reasoning dynamics, we use synthetic logic puzzles as training data due to +their controllable complexity and straightforward answer verification. We make +some key technical contributions that lead to effective and stable RL training: +a system prompt that emphasizes the thinking and answering process, a stringent +format reward function that penalizes outputs for taking shortcuts, and a +straightforward training recipe that achieves stable convergence. Our 7B model +develops advanced reasoning skills-such as reflection, verification, and +summarization-that are absent from the logic corpus. Remarkably, after training +on just 5K logic problems, it demonstrates generalization abilities to the +challenging math benchmarks AIME and AMC. -摘要:失智症是一種影響全球數百萬人的衰弱性神經疾病,在診斷上具有重大挑戰。在這項工作中,我們提出了一種新的方法,用於對失智和非失智老年患者進行分類,使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術,用於選擇性處理 MRI 切片,重點關注最相關的大腦區域,並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充,該委員會由三個自定義深度學習模型組成:Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性,利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試,我們的模型達到了 94.12% 的驚人準確度,超過了現有方法。此外,在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性,提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展,為臨床應用提供了一個高度準確且高效的工具。 +摘要:在 DeepSeek-R1 成功案例的启发下,我们探索了基于规则的强化学习 (RL) 在大型推理模型中的潜力。为了分析推理动态,我们使用合成逻辑难题作为训练数据,因为它们的可控复杂性和直接的答案验证。我们做出了一些关键的技术贡献,这些贡献导致了有效且稳定的 RL 训练:一个强调思考和回答过程的系统提示、一个严格的格式奖励函数,用于惩罚采取捷径的输出,以及一个实现稳定收敛的直接训练配方。我们的 7B 模型发展了高级推理技能,例如反射、验证和总结,这些技能在逻辑语料库中是不存在的。值得注意的是,在仅对 5K 个逻辑问题进行训练后,它展示了对具有挑战性的数学基准 AIME 和 AMC 的泛化能力。 -##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition** -2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini +##### **Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis** +2502.14767v1 by Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han -Recognizing daily activities with unobtrusive sensors in smart environments -enables various healthcare applications. Monitoring how subjects perform -activities at home and their changes over time can reveal early symptoms of -health issues, such as cognitive decline. Most approaches in this field use -deep learning models, which are often seen as black boxes mapping sensor data -to activities. However, non-expert users like clinicians need to trust and -understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human -Activity Recognition have emerged to provide intuitive natural language -explanations from these models. Different XAI methods generate different -explanations, and their effectiveness is typically evaluated through user -surveys, that are often challenging in terms of costs and fairness. This paper -proposes an automatic evaluation method using Large Language Models (LLMs) to -identify, in a pool of candidates, the best XAI approach for non-expert users. -Our preliminary results suggest that LLM evaluation aligns with user surveys. +With the exponential growth of research facilitated by modern technology and +improved accessibility, scientific discoveries have become increasingly +fragmented within and across fields. This makes it challenging to assess the +significance, novelty, incremental findings, and equivalent ideas between +related works, particularly those from different research communities. Large +language models (LLMs) have recently demonstrated strong quantitative and +qualitative reasoning abilities, and multi-agent LLM debates have shown promise +in handling complex reasoning tasks by exploring diverse perspectives and +reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a +framework which converts scientific papers into LLM personas that debate their +respective novelties. To emphasize structured, critical reasoning rather than +focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling +fine-grained analysis of independent novelty arguments within scholarly +articles. Through experiments on scientific literature across various domains, +evaluated by expert researchers, we demonstrate that ToD generates informative +arguments, effectively contrasts papers, and supports researchers in their +literature review. -摘要:藉由智慧環境中不引人注目的感測器辨識日常活動,能啟用各種醫療保健應用。監控受試者在家中如何執行活動,以及其隨著時間的變化,可以揭示健康問題的早期症狀,例如認知能力下降。此領域中的大多數方法都使用深度學習模型,這些模型通常被視為將感測器資料對應至活動的黑盒子。然而,非專家使用者(例如臨床醫師)需要信任並了解這些模型的輸出。因此,人類活動辨識的可解釋 AI (XAI) 方法應運而生,以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明,而其有效性通常透過使用者調查來評估,這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法,以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明,LLM 評估與使用者調查一致。 +摘要:隨著現代科技促進的研究呈指數成長,加上可近性的提升,科學發現已在各領域內外變得越來越分散。這使得評估相關作品之間的重要性、新穎性、漸進式發現和等價概念變得具有挑戰性,特別是來自不同研究社群的作品。大型語言模型 (LLM) 近期已展現出強大的量化和質化推理能力,而多重代理 LLM 辯論已在處理複雜推理任務方面展現出潛力,方法是探索不同的觀點和推理路徑。受到此啟發,我們引入了辯論樹 (ToD),這是一個將科學論文轉換為 LLM 人格的架構,這些人格會辯論各自的新穎性。為了強調結構化、批判性推理,而非僅專注於結果,ToD 會動態建構一個辯論樹,讓使用者能夠深入分析學術文章中獨立的新穎性論點。透過在不同領域的科學文獻上進行實驗,並由專家研究員進行評估,我們證明了 ToD 能產生有見地的論點、有效對比論文,並在研究人員的文獻回顧中提供協助。 -##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions** -2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil +##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** +2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes -Industry 5.0, which focuses on human and Artificial Intelligence (AI) -collaboration for performing different tasks in manufacturing, involves a -higher number of robots, Internet of Things (IoTs) devices and -interconnections, Augmented/Virtual Reality (AR), and other smart devices. The -huge involvement of these devices and interconnection in various critical -areas, such as economy, health, education and defense systems, poses several -types of potential security flaws. AI itself has been proven a very effective -and powerful tool in different areas of cybersecurity, such as intrusion -detection, malware detection, and phishing detection, among others. Just as in -many application areas, cybersecurity professionals were reluctant to accept -black-box ML solutions for cybersecurity applications. This reluctance pushed -forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool -that helps explain how decisions are made in ML-based systems. In this survey, -we present a comprehensive study of different XAI-based intrusion detection -systems for industry 5.0, and we also examine the impact of explainability and -interpretability on Cybersecurity practices through the lens of Adversarial -XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities -and challenges in XAI cybersecurity systems for industry 5.0 that elicit future -research toward XAI-based solutions to be adopted by high-stakes industry 5.0 -applications. We believe this rigorous analysis will establish a foundational -framework for subsequent research endeavors within the specified domain. +Fact verification (FV) aims to assess the veracity of a claim based on +relevant evidence. The traditional approach for automated FV includes a +three-part pipeline relying on short evidence snippets and encoder-only +inference models. More recent approaches leverage the multi-turn nature of LLMs +to address FV as a step-by-step problem where questions inquiring additional +context are generated and answered until there is enough information to make a +decision. This iterative method makes the verification process rational and +explainable. While these methods have been tested for encyclopedic claims, +exploration on domain-specific and realistic claims is missing. In this work, +we apply an iterative FV system on three medical fact-checking datasets and +evaluate it with multiple settings, including different LLMs, external web +search, and structured reasoning using logic predicates. We demonstrate +improvements in the final performance over traditional approaches and the high +potential of step-by-step FV systems for domain-specific claims. -摘要:工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務,涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與,引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具,例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣,網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用,有助於說明在基於 ML 的系統中如何做出決策。在這項調查中,我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究,並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外,我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰,引發了未來針對 XAI 基礎解決方案的研究,以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。 +摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 -##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability** -2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic +##### **EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations** +2502.14760v1 by Haotian Zhai, Connor Lawless, Ellen Vitercik, Liu Leqi -This study aims to explore the implementation of Natural Language Processing -(NLP) and machine learning (ML) techniques to automate the coding of medical -letters with visualised explainability and light-weighted local computer -settings. Currently in clinical settings, coding is a manual process that -involves assigning codes to each condition, procedure, and medication in a -patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There -are preliminary research on automatic coding in this field using -state-of-the-art ML models; however, due to the complexity and size of the -models, the real-world deployment is not achieved. To further facilitate the -possibility of automatic coding practice, we explore some solutions in a local -computer setting; in addition, we explore the function of explainability for -transparency of AI models. We used the publicly available MIMIC-III database -and the HAN/HLAN network models for ICD code prediction purposes. We also -experimented with the mapping between ICD and SNOMED CT knowledge bases. In our -experiments, the models provided useful information for 97.98\% of codes. The -result of this investigation can shed some light on implementing automatic -clinical coding in practice, such as in hospital settings, on the local -computers used by clinicians , project page -\url{https://github.com/Glenj01/Medical-Coding}. +A fundamental problem in combinatorial optimization is identifying equivalent +formulations, which can lead to more efficient solution strategies and deeper +insights into a problem's computational complexity. The need to automatically +identify equivalence between problem formulations has grown as optimization +copilots--systems that generate problem formulations from natural language +descriptions--have proliferated. However, existing approaches to checking +formulation equivalence lack grounding, relying on simple heuristics which are +insufficient for rigorous validation. Inspired by Karp reductions, in this work +we introduce quasi-Karp equivalence, a formal criterion for determining when +two optimization formulations are equivalent based on the existence of a +mapping between their decision variables. We propose EquivaMap, a framework +that leverages large language models to automatically discover such mappings, +enabling scalable and reliable equivalence verification. To evaluate our +approach, we construct the first open-source dataset of equivalent optimization +formulations, generated by applying transformations such as adding slack +variables or valid inequalities to existing formulations. Empirically, +EquivaMap significantly outperforms existing methods, achieving substantial +improvements in correctly identifying formulation equivalence. -摘要:本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化,並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中,編碼是一種手動流程,涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如,使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究;然而,由於模型的複雜性和大小,並未實現實際部署。為了進一步促進自動編碼實務的可能性,我們在本地電腦設定中探討了一些解決方案;此外,我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中,這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解,例如在醫院環境中,由臨床醫生使用的本地電腦,專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。 +摘要:組合優化中的基本問題在於識別等效公式,這可能導致更有效的解決策略,並更深入地了解問題的計算複雜性。隨著優化輔助系統(從自然語言描述中產生問題公式的系統)的普及,自動識別問題公式之間等價性的需求也隨之增加。然而,現有的公式等價性檢查方法缺乏依據,依賴於簡單的啟發法,而這對於嚴格驗證來說是不夠的。受 Karp 遞減啟發,我們在這項工作中引入了準 Karp 等價性,這是一個正式標準,用於根據決策變數之間的映射存在性來確定兩個優化公式何時等效。我們提出了 EquivaMap,一個利用大型語言模型自動發現此類映射的框架,實現可擴充且可靠的等價性驗證。為了評估我們的做法,我們構建了第一個等效優化公式的開源資料集,該資料集是通過對現有公式套用轉換(例如添加鬆弛變數或有效不等式)產生的。根據經驗,EquivaMap 明顯優於現有方法,在正確識別公式等價性方面取得了顯著進展。 -##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation** -2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier +##### **On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems** +2502.14759v1 by Juraj Vladika, Florian Matthes -The support of artificial intelligence (AI) based decision-making is a key -element in future 6G networks, where the concept of native AI will be -introduced. Moreover, AI is widely employed in different critical applications -such as autonomous driving and medical diagnosis. In such applications, using -AI as black-box models is risky and challenging. Hence, it is crucial to -understand and trust the decisions taken by these models. Tackling this issue -can be achieved by developing explainable AI (XAI) schemes that aim to explain -the logic behind the black-box model behavior, and thus, ensure its efficient -and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST -framework that is oriented toward channel estimation in wireless -communications. The core idea of the XAI-CHEST framework is to identify the -relevant model inputs by inducing high noise on the irrelevant ones. This -manuscript provides the detailed theoretical foundations of the XAI-CHEST -framework. In particular, we derive the analytical expressions of the XAI-CHEST -loss functions and the noise threshold fine-tuning optimization problem. Hence -the designed XAI-CHEST delivers a smart input feature selection methodology -that can further improve the overall performance while optimizing the -architecture of the employed model. Simulation results show that the XAI-CHEST -framework provides valid interpretations, where it offers an improved bit error -rate performance while reducing the required computational complexity in -comparison to the classical DL-based channel estimation. +Retrieval-augmented generation (RAG) has emerged as an approach to augment +large language models (LLMs) by reducing their reliance on static knowledge and +improving answer factuality. RAG retrieves relevant context snippets and +generates an answer based on them. Despite its increasing industrial adoption, +systematic exploration of RAG components is lacking, particularly regarding the +ideal size of provided context, and the choice of base LLM and retrieval +method. To help guide development of robust RAG systems, we evaluate various +context sizes, BM25 and semantic search as retrievers, and eight base LLMs. +Moving away from the usual RAG evaluation with short answers, we explore the +more challenging long-form question answering in two domains, where a good +answer has to utilize the entire context. Our findings indicate that final QA +performance improves steadily with up to 15 snippets but stagnates or declines +beyond that. Finally, we show that different general-purpose LLMs excel in the +biomedical domain than the encyclopedic one, and that open-domain evidence +retrieval in large corpora is challenging. -摘要:人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素,其中將引入原生 AI 的概念。此外,AI 廣泛用於不同的關鍵應用中,例如自動駕駛和醫療診斷。在這些應用中,使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此,理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構,旨在解釋黑盒模型行為背後的邏輯,從而確保其有效且安全的部署。最近,我們提出了一個新的基於擾動的 XAI-CHEST 框架,該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是,我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此,設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法,可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明,XAI-CHEST 框架提供了有效的解釋,在降低所需的計算複雜度的同時,提供了改進的比特錯誤率性能,而這與基於傳統 DL 的信道估計相比。 +摘要:檢索增強生成 (RAG) 已成為一種方法,可透過減少大型語言模型 (LLM) 對靜態知識的依賴,並改善答案的真實性,來增強大型語言模型 (LLM)。RAG 會擷取相關的內容片段,並根據這些片段產生答案。儘管其產業採用率不斷提高,但缺乏對 RAG 組成的系統性探討,特別是在提供的內容的理想大小,以及基礎 LLM 和檢索方法的選擇方面。為了協助引導穩健 RAG 系統的開發,我們評估了各種內容大小、BM25 和語意搜尋作為檢索器,以及八個基礎 LLM。我們不再使用簡短答案進行常見的 RAG 評估,而是探討在兩個領域中更具挑戰性的長篇問答,其中一個好的答案必須利用整個內容。我們的研究結果指出,最終的問答效能會隨著多達 15 個片段而穩定提升,但在超過這個數量後就會停滯或下降。最後,我們表明不同的通用 LLM 在生物醫學領域比百科全書領域更為出色,而且在大型語料庫中進行開放領域證據檢索具有挑戰性。 -##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification** -2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman +##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** +2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari -This paper presents dilated Residual Network (ResNet) models for disease -classification from retinal fundus images. Dilated convolution filters are used -to replace normal convolution filters in the higher layers of the ResNet model -(dilated ResNet) in order to improve the receptive field compared to the normal -ResNet model for disease classification. This study introduces -computer-assisted diagnostic tools that employ deep learning, enhanced with -explainable AI techniques. These techniques aim to make the tool's -decision-making process transparent, thereby enabling medical professionals to -understand and trust the AI's diagnostic decision. They are particularly -relevant in today's healthcare landscape, where there is a growing demand for -transparency in AI applications to ensure their reliability and ethical use. -The dilated ResNet is used as a replacement for the normal ResNet to enhance -the classification accuracy of retinal eye diseases and reduce the required -computing time. The dataset used in this work is the Ocular Disease Intelligent -Recognition (ODIR) dataset which is a structured ophthalmic database with eight -classes covering most of the common retinal eye diseases. The evaluation -metrics used in this work include precision, recall, accuracy, and F1 score. In -this work, a comparative study has been made between normal ResNet models and -dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, -ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as -compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, -and 0.70 respectively for the above respective variants in ODIR multiclass -disease classification. +Medical images are acquired at high resolutions with large fields of view in +order to capture fine-grained features necessary for clinical decision-making. +Consequently, training deep learning models on medical images can incur large +computational costs. In this work, we address the challenge of downsizing +medical images in order to improve downstream computational efficiency while +preserving clinically-relevant features. We introduce MedVAE, a family of six +large-scale 2D and 3D autoencoders capable of encoding medical images as +downsized latent representations and decoding latent representations back to +high-resolution images. We train MedVAE autoencoders using a novel two-stage +training approach with 1,052,730 medical images. Across diverse tasks obtained +from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent +representations in place of high-resolution images when training downstream +models can lead to efficiency benefits (up to 70x improvement in throughput) +while simultaneously preserving clinically-relevant features and (2) MedVAE can +decode latent representations back to high-resolution images with high +fidelity. Our work demonstrates that large-scale, generalizable autoencoders +can help address critical efficiency challenges in the medical domain. Our code +is available at https://github.com/StanfordMIMI/MedVAE. -摘要:这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器(扩张 ResNet),以改善感知场,从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具,并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化,从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关,在该领域,对 AI 应用的透明度需求不断增长,以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品,以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集,这是一个结构化的眼科数据库,包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中,对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比,扩张 ResNet 模型显示出有希望的结果,在 ODIR 多类疾病分类中,上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。 +摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 -##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis** -2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li +##### **TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators** +2502.14752v1 by Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun -The rapid advancement of foundation models in medical imaging represents a -significant leap toward enhancing diagnostic accuracy and personalized -treatment. However, the deployment of foundation models in healthcare -necessitates a rigorous examination of their trustworthiness, encompassing -privacy, robustness, reliability, explainability, and fairness. The current -body of survey literature on foundation models in medical imaging reveals -considerable gaps, particularly in the area of trustworthiness. Additionally, -existing surveys on the trustworthiness of foundation models do not adequately -address their specific variations and applications within the medical imaging -domain. This survey aims to fill that gap by presenting a novel taxonomy of -foundation models used in medical imaging and analyzing the key motivations for -ensuring their trustworthiness. We review current research on foundation models -in major medical imaging applications, focusing on segmentation, medical report -generation, medical question and answering (Q\&A), and disease diagnosis. These -areas are highlighted because they have seen a relatively mature and -substantial number of foundation models compared to other applications. We -focus on literature that discusses trustworthiness in medical image analysis -manuscripts. We explore the complex challenges of building trustworthy -foundation models for each application, summarizing current concerns and -strategies for enhancing trustworthiness. Furthermore, we examine the potential -of these models to revolutionize patient care. Our analysis underscores the -imperative for advancing towards trustworthy AI in medical image analysis, -advocating for a balanced approach that fosters innovation while ensuring -ethical and equitable healthcare delivery. +Triton, a high-level Python-like language designed for building efficient GPU +kernels, is widely adopted in deep learning frameworks due to its portability, +flexibility, and accessibility. However, programming and parallel optimization +still require considerable trial and error from Triton developers. Despite +advances in large language models (LLMs) for conventional code generation, +these models struggle to generate accurate, performance-optimized Triton code, +as they lack awareness of its specifications and the complexities of GPU +programming. More critically, there is an urgent need for systematic +evaluations tailored to Triton. In this work, we introduce TritonBench, the +first comprehensive benchmark for Triton operator generation. TritonBench +features two evaluation channels: a curated set of 184 real-world operators +from GitHub and a collection of operators aligned with PyTorch interfaces. +Unlike conventional code benchmarks prioritizing functional correctness, +TritonBench also profiles efficiency performance on widely deployed GPUs +aligned with industry applications. Our study reveals that current +state-of-the-art code LLMs struggle to generate efficient Triton operators, +highlighting a significant gap in high-performance code generation. TritonBench +will be available at https://github.com/thunlp/TritonBench. -摘要:基礎模型在醫學影像方面的快速進展,代表著在加強診斷準確性和個人化治療方面邁出一大步。然而,基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查,包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距,特別是在可信度方面。此外,現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機,來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究,重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調,是因為與其他應用相比,它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰,總結了當前關注點和增強可信度的策略。此外,我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性,並倡導一種平衡的方法,既能促進創新,又能確保道德和公平的醫療保健服務。 +摘要:Triton 是一種高階的類 Python 語言,專門用於建構高效的 GPU 核心,由於其可移植性、靈活性及可存取性,已廣泛採用於深度學習框架中。然而,編程和並行最佳化仍需要 Triton 開發人員進行大量的試驗和錯誤。儘管大型語言模型 (LLM) 在傳統程式碼產生方面取得了進展,但這些模型在產生準確且效能最佳化的 Triton 程式碼時仍面臨困難,因為它們缺乏對其規格和 GPU 編程複雜性的認識。更重要的是,迫切需要針對 Triton 量身打造的系統性評估。在這項工作中,我們介紹 TritonBench,這是第一個針對 Triton 算子產生進行全面評比的基準。TritonBench 具有兩個評估管道:一組來自 GitHub 的 184 個真實世界算子,以及一組與 PyTorch 介面對齊的算子。與優先考慮功能正確性的傳統程式碼基準不同,TritonBench 還剖析了與產業應用對齊的廣泛部署 GPU 上的效能表現。我們的研究表明,目前最先進的程式碼 LLM 難以產生高效的 Triton 算子,突顯了高性能程式碼產生中的重大差距。TritonBench 將在 https://github.com/thunlp/TritonBench 提供。 -##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data** -2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan +##### **Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs** +2502.14748v1 by Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, Jordan Boyd-Graber -Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and -interpreting ultrasound scans right at the patient's bedside. However, the -expertise needed to interpret these images is considerable and may not always -be present in emergency situations. This reality makes algorithms such as -machine learning classifiers extremely valuable to augment human decisions. -POCUS devices are becoming available at a reasonable cost in the size of a -mobile phone. The challenge of turning POCUS devices into life-saving tools is -that interpretation of ultrasound images requires specialist training and -experience. Unfortunately, the difficulty to obtain positive training images -represents an important obstacle to building efficient and accurate -classifiers. Hence, the problem we try to investigate is how to explore -strategies to increase accuracy of classifiers trained with scarce data. We -hypothesize that training with a few data instances may not suffice for -classifiers to generalize causing them to overfit. Our approach uses an -Explainable AI-Augmented approach to help the algorithm learn more from less -and potentially help the classifier better generalize. +A common use of NLP is to facilitate the understanding of large document +collections, with a shift from using traditional topic models to Large Language +Models. Yet the effectiveness of using LLM for large corpus understanding in +real-world applications remains under-explored. This study measures the +knowledge users acquire with unsupervised, supervised LLM-based exploratory +approaches or traditional topic models on two datasets. While LLM-based methods +generate more human-readable topics and show higher average win probabilities +than traditional models for data exploration, they produce overly generic +topics for domain-specific datasets that do not easily allow users to learn +much about the documents. Adding human supervision to the LLM generation +process improves data exploration by mitigating hallucination and +over-genericity but requires greater human effort. In contrast, traditional. +models like Latent Dirichlet Allocation (LDA) remain effective for exploration +but are less user-friendly. We show that LLMs struggle to describe the haystack +of large corpora without human help, particularly domain-specific data, and +face scaling and hallucination limitations due to context length constraints. +Dataset available at https://huggingface. co/datasets/zli12321/Bills. -摘要:床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而,解讀這些影像所需的專業知識相當可觀,而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出,尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於,解讀超音波影像需要專門訓練和經驗。不幸的是,取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此,我們嘗試探討的問題是如何探索策略,以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括,導致它們過度擬合。我們的做法使用可解釋 AI 增強方法,以協助演算法從較少的資料中學習更多,並潛在協助分類器更好地概括。 +摘要:NLP 的常見用途是促進對大型文件集合的理解,從使用傳統主題模型轉向大型語言模型。然而,在現實世界的應用中使用 LLM 了解大型語料庫的有效性仍未得到充分探索。本研究衡量了使用者在兩個資料集上使用無監督、監督的基於 LLM 的探索性方法或傳統主題模型獲得的知識。雖然基於 LLM 的方法會產生更多人類可讀的主題,並且顯示出比傳統模型更高的平均獲勝機率,但它們會為特定領域的資料集產生過於通用的主題,而這些主題不容易讓使用者對文件有深入了解。在 LLM 生成過程中加入人類監督可透過減輕幻覺和過度泛化來改善資料探索,但需要更多的人力。相反地,傳統模型(如潛在狄利克雷配置 (LDA))仍然有效於探索,但使用者友善度較低。我們表明,LLM 難以在沒有人類幫助的情況下描述大型語料庫的乾草堆,特別是特定領域的資料,並且會因上下文長度限制而面臨擴充性和幻覺限制。資料集可於 https://huggingface.co/datasets/zli12321/Bills 取得。 -##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach** -2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang +##### **HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States** +2502.14744v1 by Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue -In recent years, the United States has witnessed a significant surge in the -popularity of vaping or e-cigarette use, leading to a notable rise in cases of -e-cigarette and vaping use-associated lung injury (EVALI) that caused -hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting -the urgency to comprehend vaping behaviors and develop effective strategies for -cessation. Due to the ubiquity of social media platforms, over 4.7 billion -users worldwide use them for connectivity, communications, news, and -entertainment with a significant portion of the discourse related to health, -thereby establishing social media data as an invaluable organic data resource -for public health research. In this study, we extracted a sample dataset from -one vaping sub-community on Reddit to analyze users' quit-vaping intentions. -Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit -vaping intention detection, this study compares the outcomes of this model -against layman and clinical expert annotations. Using different prompting -strategies such as zero-shot, one-shot, few-shot and chain-of-thought -prompting, we developed 8 prompts with varying levels of detail to explain the -task to GPT-4 and also evaluated the performance of the strategies against each -other. These preliminary findings emphasize the potential of GPT-4 in social -media data analysis, especially in identifying users' subtle intentions that -may elude human detection. +The integration of additional modalities increases the susceptibility of +large vision-language models (LVLMs) to safety risks, such as jailbreak +attacks, compared to their language-only counterparts. While existing research +primarily focuses on post-hoc alignment techniques, the underlying safety +mechanisms within LVLMs remain largely unexplored. In this work , we +investigate whether LVLMs inherently encode safety-relevant signals within +their internal activations during inference. Our findings reveal that LVLMs +exhibit distinct activation patterns when processing unsafe prompts, which can +be leveraged to detect and mitigate adversarial inputs without requiring +extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a +novel tuning-free framework that harnesses internal model activations to +enhance safety. Experimental results show that {HiddenDetect} surpasses +state-of-the-art methods in detecting jailbreak attacks against LVLMs. By +utilizing intrinsic safety-aware patterns, our method provides an efficient and +scalable solution for strengthening LVLM robustness against multimodal threats. +Our code will be released publicly at +https://github.com/leigest519/HiddenDetect. -摘要:近年來,美國見證了電子煙或電子香菸使用率大幅激增,導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加,在 2019 年 EVALI 爆發期間造成住院和死亡,凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及,全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂,其中很大一部分與健康相關,因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中,我們從 Reddit 上一個電子煙子社群中提取一個範例資料集,以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測,本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略,例如零次學習、一次學習、少次學習和思考鏈提示,我們開發了 8 個提示,詳細程度不同,向 GPT-4 解釋任務,並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力,特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。 +摘要:整合其他模态会增加大型视觉语言模型 (LVLMs) 对安全风险的敏感性,例如越狱攻击,与仅语言的对应模型相比。虽然现有的研究主要集中于事后对齐技术,但 LVLMs 内部的基本安全机制在很大程度上仍未得到探索。在这项工作中,我们调查了 LVLMs 在推理过程中是否在其内部激活中固有地编码了与安全相关的信号。我们的研究结果表明,LVLMs 在处理不安全提示时表现出不同的激活模式,这可以用来检测和缓解对抗性输入,而无需进行广泛的微调。基于这一见解,我们引入了 HiddenDetect,这是一个新颖的无调优框架,利用内部模型激活来增强安全性。实验结果表明,{HiddenDetect} 在检测针对 LVLMs 的越狱攻击方面超越了最先进的方法。通过利用内在的安全感知模式,我们的方法为加强 LVLM 对多模态威胁的鲁棒性提供了一种高效且可扩展的解决方案。我们的代码将在 https://github.com/leigest519/HiddenDetect 公开发布。 -##### **Towards Compositional Interpretability for XAI** -2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke +##### **Multi-Agent Coordination across Diverse Applications: A Survey** +2502.14743v1 by Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin-Teng Lin, Yang Shen -Artificial intelligence (AI) is currently based largely on black-box machine -learning models which lack interpretability. The field of eXplainable AI (XAI) -strives to address this major concern, being critical in high-stakes areas such -as the finance, legal and health sectors. - We present an approach to defining AI models and their interpretability based -on category theory. For this we employ the notion of a compositional model, -which sees a model in terms of formal string diagrams which capture its -abstract structure together with its concrete implementation. This -comprehensive view incorporates deterministic, probabilistic and quantum -models. We compare a wide range of AI models as compositional models, including -linear and rule-based models, (recurrent) neural networks, transformers, VAEs, -and causal and DisCoCirc models. - Next we give a definition of interpretation of a model in terms of its -compositional structure, demonstrating how to analyse the interpretability of a -model, and using this to clarify common themes in XAI. We find that what makes -the standard 'intrinsically interpretable' models so transparent is brought out -most clearly diagrammatically. This leads us to the more general notion of -compositionally-interpretable (CI) models, which additionally include, for -instance, causal, conceptual space, and DisCoCirc models. - We next demonstrate the explainability benefits of CI models. Firstly, their -compositional structure may allow the computation of other quantities of -interest, and may facilitate inference from the model to the modelled -phenomenon by matching its structure. Secondly, they allow for diagrammatic -explanations for their behaviour, based on influence constraints, diagram -surgery and rewrite explanations. Finally, we discuss many future directions -for the approach, raising the question of how to learn such meaningfully -structured models in practice. +Multi-agent coordination studies the underlying mechanism enabling the +trending spread of diverse multi-agent systems (MAS) and has received +increasing attention, driven by the expansion of emerging applications and +rapid AI advances. This survey outlines the current state of coordination +research across applications through a unified understanding that answers four +fundamental coordination questions: (1) what is coordination; (2) why +coordination; (3) who to coordinate with; and (4) how to coordinate. Our +purpose is to explore existing ideas and expertise in coordination and their +connections across diverse applications, while identifying and highlighting +emerging and promising research directions. First, general coordination +problems that are essential to varied applications are identified and analyzed. +Second, a number of MAS applications are surveyed, ranging from widely studied +domains, e.g., search and rescue, warehouse automation and logistics, and +transportation systems, to emerging fields including humanoid and +anthropomorphic robots, satellite systems, and large language models (LLMs). +Finally, open challenges about the scalability, heterogeneity, and learning +mechanisms of MAS are analyzed and discussed. In particular, we identify the +hybridization of hierarchical and decentralized coordination, human-MAS +coordination, and LLM-based MAS as promising future directions. -摘要:人工智慧(AI)目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧(XAI)領域致力於解決這個主要問題,這在金融、法律和健康等高風險領域至關重要。 -我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此,我們採用組合模型的概念,它以形式弦圖的形式看待模型,這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較,包括線性和基於規則的模型、(遞迴)神經網路、Transformer、VAE,以及因果和 DisCoCirc 模型。 -接下來,我們根據模型的組合結構給出模型解釋的定義,展示如何分析模型的可解釋性,並使用它來澄清 XAI 中的常見主題。我們發現,讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋(CI)模型概念,它另外還包括因果、概念空間和 DisCoCirc 模型。 -接下來,我們展示了 CI 模型的可解釋性優勢。首先,它們的組合結構允許計算其他感興趣的量,並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次,它們允許對其行為進行圖解說明,這些說明基於影響約束、圖解手術和重寫說明。最後,我們討論了這種方法的許多未來方向,提出了如何在實踐中學習這種有意義的結構化模型的問題。 +摘要:多智能體協調研究探討了促成各種多智能體系統 (MAS) 流行擴散的底層機制,並隨著新興應用擴展和 AI 快速進展而受到越來越多的關注。這項調查透過統一的理解來概述協調研究的現狀,回答了四個基本的協調問題:(1) 什麼是協調;(2) 為什麼協調;(3) 與誰協調;以及 (4) 如何協調。我們的目的是探索協調中現有的想法和專業知識,以及它們在不同應用中的關聯,同時找出並強調新興且有前景的研究方向。首先,找出並分析了對各種應用至關重要的協調問題。其次,調查了許多 MAS 應用,範圍從廣泛研究的領域(例如搜尋和救援、倉庫自動化和物流,以及運輸系統),到新興領域,包括人形機器人和擬人機器人、衛星系統和大語言模型 (LLM)。最後,分析並討論了有關 MAS 的可擴充性、異質性和學習機制的開放挑戰。特別是,我們將分層協調和分散式協調、人類-MAS 協調和基於 LLM 的 MAS 的混合視為有前景的未來方向。 -##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods** -2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen +##### **YOLOv12: A Breakdown of the Key Architectural Features** +2502.14740v1 by Mujadded Al Rabbani Alif, Muhammad Hussain -Machine learning models have achieved high overall accuracy in medical image -analysis. However, performance disparities on specific patient groups pose -challenges to their clinical utility, safety, and fairness. This can affect -known patient groups - such as those based on sex, age, or disease subtype - as -well as previously unknown and unlabeled groups. Furthermore, the root cause of -such observed performance disparities is often challenging to uncover, -hindering mitigation efforts. In this paper, to address these issues, we -leverage Slice Discovery Methods (SDMs) to identify interpretable -underperforming subsets of data and formulate hypotheses regarding the cause of -observed performance disparities. We introduce a novel SDM and apply it in a -case study on the classification of pneumothorax and atelectasis from chest -x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis -formulation and yields an explanation of previously observed but unexplained -performance disparities between male and female patients in widely used chest -X-ray datasets and models. Our findings indicate shortcut learning in both -classification tasks, through the presence of chest drains and ECG wires, -respectively. Sex-based differences in the prevalence of these shortcut -features appear to cause the observed classification performance gap, -representing a previously underappreciated interaction between shortcut -learning and model fairness analyses. +This paper presents an architectural analysis of YOLOv12, a significant +advancement in single-stage, real-time object detection building upon the +strengths of its predecessors while introducing key improvements. The model +incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and +FlashAttention-driven area-based attention, improving feature extraction, +enhanced efficiency, and robust detections. With multiple model variants, +similar to its predecessors, YOLOv12 offers scalable solutions for both +latency-sensitive and high-accuracy applications. Experimental results manifest +consistent gains in mean average precision (mAP) and inference speed, making +YOLOv12 a compelling choice for applications in autonomous systems, security, +and real-time analytics. By achieving an optimal balance between computational +efficiency and performance, YOLOv12 sets a new benchmark for real-time computer +vision, facilitating deployment across diverse hardware platforms, from edge +devices to high-performance clusters. -摘要:機器學習模型在醫學影像分析中已達到整體高準確度。然而,特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體(例如基於性別、年齡或疾病亞型)以及先前未知且未標籤的群體。此外,此類觀察到的效能差異的根本原因通常難以發現,阻礙了緩解措施。在本文中,為了解決這些問題,我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集,並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM,並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性,並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明,在分類任務中,透過胸腔引流管和心電圖導線的存在,存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異,似乎會導致觀察到的分類效能差距,這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。 +摘要:本文提出 YOLOv12 的架構分析,這是在單階段即時物件偵測領域的重大進展,它建立在前任的優勢之上,同時引入了關鍵改進。該模型結合了最佳化的主幹 (R-ELAN)、7x7 可分離卷積和 FlashAttention 驅動的基於區域的注意力,改進了特徵提取、增強了效率和穩健的偵測。與其前身類似,YOLOv12 具有多種模型變體,為低延遲敏感型和高準確度應用程式提供了可擴充的解決方案。實驗結果顯示在平均準確度 (mAP) 和推論速度方面都有顯著的提升,這使得 YOLOv12 成為自動化系統、安全性和即時分析應用程式的理想選擇。透過在運算效率和效能之間取得最佳平衡,YOLOv12 為即時電腦視覺樹立了新的基準,促進了在各種硬體平台(從邊緣裝置到高性能叢集)上的部署。 -##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health** -2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa +##### **SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines** +2502.14739v1 by M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang -The concept of Metaverse has attracted a lot of attention in various fields -and one of its important applications is health and treatment. The Metaverse -has enormous potential to transform healthcare by changing patient care, -medical education, and the way teaching/learning and research are done. The -purpose of this research is to provide an introduction to the basic concepts -and fundamental technologies of the Metaverse. This paper examines the pros and -cons of the Metaverse in healthcare context and analyzes its potential from the -technology and AI perspective. In particular, the role of machine learning -methods is discussed; We will explain how machine learning algorithms can be -applied to the Metaverse generated data to gain better insights in healthcare -applications. Additionally, we examine the future visions of the Metaverse in -health delivery, by examining emerging technologies such as blockchain and also -addressing privacy concerns. The findings of this study contribute to a deeper -understanding of the applications of Metaverse in healthcare and its potential -to revolutionize the delivery of medical services. +Large language models (LLMs) have demonstrated remarkable proficiency in +mainstream academic disciplines such as mathematics, physics, and computer +science. However, human knowledge encompasses over 200 specialized disciplines, +far exceeding the scope of existing benchmarks. The capabilities of LLMs in +many of these specialized fields-particularly in light industry, agriculture, +and service-oriented disciplines-remain inadequately evaluated. To address this +gap, we present SuperGPQA, a comprehensive benchmark that evaluates +graduate-level knowledge and reasoning capabilities across 285 disciplines. Our +benchmark employs a novel Human-LLM collaborative filtering mechanism to +eliminate trivial or ambiguous questions through iterative refinement based on +both LLM responses and expert feedback. Our experimental results reveal +significant room for improvement in the performance of current state-of-the-art +LLMs across diverse knowledge domains (e.g., the reasoning-focused model +DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting +the considerable gap between current model capabilities and artificial general +intelligence. Additionally, we present comprehensive insights from our +management of a large-scale annotation process, involving over 80 expert +annotators and an interactive Human-LLM collaborative system, offering valuable +methodological guidance for future research initiatives of comparable scope. -摘要:元宇宙的概念在各個領域都備受關注,其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育,以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點,並從技術和 AI 的角度分析其潛力。特別是,討論了機器學習方法的角色;我們將說明如何將機器學習演算法應用於元宇宙產生的資料,以獲得醫療保健應用方面的更佳見解。此外,我們透過探討區塊鏈等新興技術,並解決隱私問題,來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用,以及其在醫療服務提供方面發揮革命性變革的潛力。 +摘要:大型語言模型 (LLM) 已展現出在主流學術領域(如數學、物理和電腦科學)的卓越能力。然而,人類知識包含超過 200 個專業領域,遠遠超過現有基準的範圍。LLM 在許多這些專業領域(特別是在輕工業、農業和服務導向領域)的能力仍未得到充分評估。為了解決這個差距,我們提出了 SuperGPQA,這是一個綜合基準,用於評估 285 個領域的研究生級知識和推理能力。我們的基準採用新穎的人類-LLM 協同過濾機制,透過基於 LLM 回應和專家回饋的迭代改進,來消除瑣碎或模稜兩可的問題。我們的實驗結果顯示,當前最先進的 LLM 在不同知識領域的表現仍有很大的改進空間(例如,以推理為重點的模型 DeepSeek-R1 在 SuperGPQA 上達到了 61.82% 的最高準確度),突顯了當前模型能力與人工通用智慧之間的巨大差距。此外,我們從管理大型註釋過程(涉及 80 多位專家註釋者和一個互動式人類-LLM 協作系統)中提出了全面的見解,為未來具有可比規模的研究計畫提供了寶貴的方法論指導。 -##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI** -2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf +##### **EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration** +2502.14735v1 by Minjie Hong, Yan Xia, Zehan Wang, Jieming Zhu, Ye Wang, Sihang Cai, Xiaoda Yang, Quanyu Dai, Zhenhua Dong, Zhimeng Zhang, Zhou Zhao -Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with -no known ultimo cure and high morbidity. Research demonstrates that progressive -Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly -impacts kidney structure and functions, eventually leading to kidney failure. -With the progression of time, chronic kidney disease has moved from a -life-threatening disease affecting few people to a common disorder of varying -severity. The goal of this research is to visualize dominating features, -feature scores, and values exhibited for early prognosis and detection of CKD -using ensemble learning and explainable AI. For that, an AI-driven predictive -analytics approach is proposed to aid clinical practitioners in prescribing -lifestyle modifications for individual patients to reduce the rate of -progression of this disease. Our dataset is collected on body vitals from -individuals with CKD and healthy subjects to develop our proposed AI-driven -solution accurately. In this regard, blood and urine test results are provided, -and ensemble tree-based machine-learning models are applied to predict unseen -cases of CKD. Our research findings are validated after lengthy consultations -with nephrologists. Our experiments and interpretation results are compared -with existing explainable AI applications in various healthcare domains, -including CKD. The comparison shows that our developed AI models, particularly -the Random Forest model, have identified more features as significant -contributors than XgBoost. Interpretability (I), which measures the ratio of -important to masked features, indicates that our XgBoost model achieved a -higher score, specifically a Fidelity of 98\%, in this metric and naturally in -the FII index compared to competing models. +Large language models (LLMs) are increasingly leveraged as foundational +backbones in the development of advanced recommender systems, offering enhanced +capabilities through their extensive knowledge and reasoning. Existing +llm-based recommender systems (RSs) often face challenges due to the +significant differences between the linguistic semantics of pre-trained LLMs +and the collaborative semantics essential for RSs. These systems use +pre-trained linguistic semantics but learn collaborative semantics from scratch +via the llm-Backbone. However, LLMs are not designed for recommendations, +leading to inefficient collaborative learning, weak result correlations, and +poor integration of traditional RS features. To address these challenges, we +propose EAGER-LLM, a decoder-only llm-based generative recommendation framework +that integrates endogenous and exogenous behavioral and semantic information in +a non-intrusive manner. Specifically, we propose 1)dual-source knowledge-rich +item indices that integrates indexing sequences for exogenous signals, enabling +efficient link-wide processing; 2)non-invasive multiscale alignment +reconstruction tasks guide the model toward a deeper understanding of both +collaborative and semantic signals; 3)an annealing adapter designed to finely +balance the model's recommendation performance with its comprehension +capabilities. We demonstrate EAGER-LLM's effectiveness through rigorous testing +on three public benchmarks. -摘要:慢性腎臟病 (CKD) 是一種廣泛的慢性疾病,目前尚未找到最終的治療方法,且發病率很高。研究表明,進行性慢性腎臟病 (CKD) 是一種異質性疾病,會顯著影響腎臟結構和功能,最終導致腎衰竭。隨著時間的推移,慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值,以進行 CKD 的早期預後和檢測。為此,提出了一種 AI 驅動的預測分析方法,以幫助臨床醫生為個別患者開具生活方式的修改建議,以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的,以準確開發我們提出的 AI 驅動的解決方案。在這方面,提供了血液和尿液檢測結果,並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較,包括 CKD。比較表明,我們開發的 AI 模型,特別是隨機森林模型,已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率,表明我們的 XgBoost 模型在此指標中取得了更高的分數,特別是 98% 的保真度,並且在 FII 指數中自然高於競爭模型。 +摘要:大型語言模型(LLM)正日益被用作先進推薦系統開發中的基礎主幹,透過其廣泛的知識和推理能力提供增強功能。現有的基於 LLM 的推薦系統(RS)通常會因為預先訓練的 LLM 語言語義與 RS 必備的協作語義之間的顯著差異而面臨挑戰。這些系統使用預先訓練的語言語義,但透過 LLM 主幹從頭學習協作語義。然而,LLM 並非專為推薦而設計,導致協作學習效率低落、結果關聯性薄弱,以及與傳統 RS 功能整合不佳。為了應對這些挑戰,我們提出 EAGER-LLM,這是一種僅解碼器、基於 LLM 的生成推薦架構,能以非侵入性方式整合內生和外生行為和語義資訊。具體來說,我們提出 1) 雙來源、知識豐富的項目索引,它整合了外生訊號的索引序列,實現了高效的鏈路廣泛處理;2) 非侵入式多尺度對齊重建任務引導模型更深入地理解協作和語義訊號;3) 退火適配器旨在精細地平衡模型的推薦效能與其理解能力。我們透過在三個公共基準上的嚴格測試證明了 EAGER-LLM 的有效性。 -##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook** -2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan +##### **Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models** +2502.14734v1 by Hongji Li, Andrianos Michail, Reto Gubelmann, Simon Clematide, Juri Opitz -Mental health constitutes a complex and pervasive global challenge, affecting -millions of lives and often leading to severe consequences. In this paper, we -conduct a thorough survey to explore the intersection of data science, -artificial intelligence, and mental healthcare, focusing on the recent -developments of mental disorder detection through online social media (OSM). A -significant portion of the population actively engages in OSM platforms, -creating a vast repository of personal data that holds immense potential for -mental health analytics. The paper navigates through traditional diagnostic -methods, state-of-the-art data- and AI-driven research studies, and the -emergence of explainable AI (XAI) models for mental healthcare. We review -state-of-the-art machine learning methods, particularly those based on modern -deep learning, while emphasising the need for explainability in healthcare AI -models. The experimental design section provides insights into prevalent -practices, including available datasets and evaluation approaches. We also -identify key issues and challenges in the field and propose promising future -research directions. As mental health decisions demand transparency, -interpretability, and ethical considerations, this paper contributes to the -ongoing discourse on advancing XAI in mental healthcare through social media. -The comprehensive overview presented here aims to guide researchers, -practitioners, and policymakers in developing the area of mental disorder -detection. +We propose the Sentence Smith framework that enables controlled and specified +manipulation of text meaning. It consists of three main steps: 1. Parsing a +sentence into a semantic graph, 2. Applying human-designed semantic +manipulation rules, and 3. Generating text from the manipulated graph. A final +filtering step (4.) ensures the validity of the applied transformation. To +demonstrate the utility of Sentence Smith in an application study, we use it to +generate hard negative pairs that challenge text embedding models. Since the +controllable generation makes it possible to clearly isolate different types of +semantic shifts, we can gain deeper insights into the specific strengths and +weaknesses of widely used text embedding models, also addressing an issue in +current benchmarking where linguistic phenomena remain opaque. Human validation +confirms that the generations produced by Sentence Smith are highly accurate. -摘要:心理健康構成了一項複雜且普遍的全球挑戰,影響了數百萬人的生活,並經常導致嚴重的後果。在本文中,我們進行了一項徹底的調查,以探索數據科學、人工智慧和心理保健的交集,重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台,創造了一個龐大的人員資料庫,對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究,以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法,特別是那些基於現代深度學習的方法,同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解,包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰,並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量,本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。 +摘要:我們提出 Sentence Smith 框架,它能控制並指定文本含義的處理。它包含三個主要步驟:1. 將句子解析成語義圖形,2. 套用人為設計的語義處理規則,3. 從處理過的圖形生成文本。最後的過濾步驟 (4.) 確保套用轉換的有效性。為了在應用研究中展示 Sentence Smith 的效用,我們使用它來產生挑戰文本嵌入模型的困難負面對。由於可控生成能清楚地隔離不同類型的語義轉移,我們能更深入地了解廣泛使用的文本嵌入模型的具體優點和缺點,同時也解決了語言現象在當前基準測試中仍然不透明的問題。人為驗證確認 Sentence Smith 產生的生成高度準確。 -##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance** -2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou +##### **WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models** +2502.14727v1 by Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao -AI-aided clinical diagnosis is desired in medical care. Existing deep -learning models lack explainability and mainly focus on image analysis. The -recently developed Dynamic Uncertain Causality Graph (DUCG) approach is -causality-driven, explainable, and invariant across different application -scenarios, without problems of data collection, labeling, fitting, privacy, -bias, generalization, high cost and high energy consumption. Through close -collaboration between clinical experts and DUCG technicians, 46 DUCG models -covering 54 chief complaints were constructed. Over 1,000 diseases can be -diagnosed without triage. Before being applied in real-world, the 46 DUCG -models were retrospectively verified by third-party hospitals. The verified -diagnostic precisions were no less than 95%, in which the diagnostic precision -for every disease including uncommon ones was no less than 80%. After -verifications, the 46 DUCG models were applied in the real-world in China. Over -one million real diagnosis cases have been performed, with only 17 incorrect -diagnoses identified. Due to DUCG's transparency, the mistakes causing the -incorrect diagnoses were found and corrected. The diagnostic abilities of the -clinicians who applied DUCG frequently were improved significantly. Following -the introduction to the earlier presented DUCG methodology, the recommendation -algorithm for potential medical checks is presented and the key idea of DUCG is -extracted. +Retrieval Augmented Generation (RAG) has gained widespread adoption owing to +its capacity to empower large language models (LLMs) to integrate external +knowledge. However, existing RAG frameworks are primarily designed for +text-based LLMs and rely on Automatic Speech Recognition to process speech +input, which discards crucial audio information, risks transcription errors, +and increases computational overhead. Therefore, we introduce WavRAG, the first +retrieval augmented generation framework with native, end-to-end audio support. +WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw +audio for both embedding and retrieval. 2) WavRAG integrates audio and text +into a unified knowledge representation. Specifically, we propose the +WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge +base, and further enhance the in-context capabilities of spoken dialogue models +through the integration of chain-of-thought reasoning. In comparison to +state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval +performance while delivering a 10x acceleration. Furthermore, WavRAG's unique +text-audio hybrid retrieval capability extends the boundaries of RAG to the +audio modality. + +摘要:檢索增強生成 (RAG) 因其賦能大型語言模型 (LLM) 整合外部知識的能力而獲得廣泛採用。然而,現有的 RAG 框架主要設計用於基於文字的 LLM,並依賴自動語音辨識處理語音輸入,這會捨棄重要的音訊資訊、有轉錄錯誤的風險,並增加運算負擔。因此,我們引入了 WavRAG,這是第一個具備原生端對端音訊支援的檢索增強生成框架。WavRAG 提供兩個主要功能:1) 繞過 ASR,WavRAG 直接處理原始音訊以進行嵌入和檢索。2) WavRAG 將音訊和文字整合到統一的知識表示中。具體來說,我們提出了 WavRetriever 以利於從文字音訊混合知識庫中進行檢索,並透過整合思考鏈推理進一步增強對話模型的語境能力。與最先進的 ASR 文字 RAG 管線相比,WavRAG 達到了相當的檢索效能,同時提供了 10 倍的加速。此外,WavRAG 獨特的文字音訊混合檢索能力將 RAG 的界線延伸到音訊模式。 + +##### **Entity Framing and Role Portrayal in the News** +2502.14718v1 by Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov + +We introduce a novel multilingual hierarchical corpus annotated for entity +framing and role portrayal in news articles. The dataset uses a unique taxonomy +inspired by storytelling elements, comprising 22 fine-grained roles, or +archetypes, nested within three main categories: protagonist, antagonist, and +innocent. Each archetype is carefully defined, capturing nuanced portrayals of +entities such as guardian, martyr, and underdog for protagonists; tyrant, +deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for +innocents. The dataset includes 1,378 recent news articles in five languages +(Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two +critical domains of global significance: the Ukraine-Russia War and Climate +Change. Over 5,800 entity mentions have been annotated with role labels. This +dataset serves as a valuable resource for research into role portrayal and has +broader implications for news analysis. We describe the characteristics of the +dataset and the annotation process, and we report evaluation results on +fine-tuned state-of-the-art multilingual transformers and hierarchical +zero-shot learning using LLMs at the level of a document, a paragraph, and a +sentence. -摘要:醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性,並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的,並且在不同的應用場景中是不變的,沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作,構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前,46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%,其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後,46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例,僅發現 17 個不正確的診斷。由於 DUCG 的透明性,發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後,提出了潛在健康檢查的推薦演算法,並提取了 DUCG 的關鍵思想。 +摘要:我們引進一個新穎的多語言層級語料庫,其中註解了新聞文章中的實體框架和角色描繪。此資料集使用了一個獨特的分類法,其靈感來自講故事元素,包含 22 個細緻的角色或原型,嵌套在三個主要類別中:主角、對手和無辜者。每個原型都經過仔細定義,捕捉了實體的細微描繪,例如主角的監護人、烈士和弱者;對手的暴君、欺騙者和偏執狂;以及無辜者的受害者、替罪羊和被剝削者。該資料集包括五種語言(保加利亞語、英語、印地語、歐洲葡萄牙語和俄語)中的 1,378 篇近期新聞文章,重點關注兩個具有全球意義的關鍵領域:烏克蘭-俄羅斯戰爭和氣候變遷。超過 5,800 個實體提及已註解為角色標籤。此資料集作為角色描繪研究的寶貴資源,並對新聞分析有更廣泛的影響。我們描述了資料集的特徵和註解過程,並報告了對使用 LLM 在文件、段落和句子層級進行微調的最新多語言轉換器和層級零次學習的評估結果。 -##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability** -2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi +##### **From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT** +2502.14714v1 by Ahmed Abdeen Hamed, Byung Suk Lee -It is imperative that breast cancer is detected precisely and timely to -improve patient outcomes. Diagnostic methodologies have traditionally relied on -unimodal approaches; however, medical data analytics is integrating diverse -data sources beyond conventional imaging. Using multi-modal techniques, -integrating both image and non-image data, marks a transformative advancement -in breast cancer diagnosis. The purpose of this review is to explore the -burgeoning field of multimodal techniques, particularly the fusion of -histopathology images with non-image data. Further, Explainable AI (XAI) will -be used to elucidate the decision-making processes of complex algorithms, -emphasizing the necessity of explainability in diagnostic processes. This -review utilizes multi-modal data and emphasizes explainability to enhance -diagnostic accuracy, clinician confidence, and patient engagement, ultimately -fostering more personalized treatment strategies for breast cancer, while also -identifying research gaps in multi-modality and explainability, guiding future -studies, and contributing to the strategic direction of the field. +The generative capabilities of LLM models present opportunities in +accelerating tasks and concerns with the authenticity of the knowledge it +produces. To address the concerns, we present a computational approach that +systematically evaluates the factual accuracy of biomedical knowledge that an +LLM model has been prompted to generate. Our approach encompasses two +processes: the generation of disease-centric associations and the verification +of them using the semantic knowledge of the biomedical ontologies. Using +ChatGPT as the select LLM model, we designed a set of prompt-engineering +processes to generate linkages between diseases, drugs, symptoms, and genes to +establish grounds for assessments. Experimental results demonstrate high +accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and +genetic information (88%-98%). The symptom term identification accuracy was +notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO +ontologies accordingly. The verification of associations reveals literature +coverage rates of (89%-91%) among disease-drug and disease-gene associations. +The low identification accuracy for symptom terms also contributed to the +verification of symptom-related associations (49%-62%). -摘要:精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法;然而,醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術,標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域,特別是將組織病理學影像與非影像資料融合。此外,可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程,強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性,以提高診斷準確性、臨床醫師的信心和患者參與度,最終促進乳癌更個人化的治療策略,同時也找出多模式和可解釋性的研究差距,引導未來的研究,並為該領域的策略方向做出貢獻。 +摘要:LLM 模型的生成能力為加速任務和對其產生的知識真實性的疑慮提供了機會。為了解決這些疑慮,我們提出了計算方法,系統性評估 LLM 模型受提示而產生的生物醫學知識的事實準確性。我們的做法包括兩個過程:生成以疾病為中心的關聯,並使用生物醫學本体的語義知識驗證它們。使用 ChatGPT 作為選定的 LLM 模型,我們設計了一組提示工程流程,以生成疾病、藥物、症狀和基因之間的關聯,作為評估的依據。實驗結果證明在識別疾病術語 (88%-97%)、藥物名稱 (90%-91%) 和遺傳資訊 (88%-98%) 方面具有很高的準確性。症狀術語識別準確性顯著較低 (49%-61%),並根據 DOID、ChEBI、SYMPTOM 和 GO 本体進行驗證。關聯驗證顯示疾病-藥物和疾病-基因關聯的文獻覆蓋率為 (89%-91%)。症狀術語的低識別準確性也影響了症狀相關關聯的驗證 (49%-62%)。 -##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection** -2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya +##### **Data-Efficient Pretraining with Group-Level Data Influence Modeling** +2502.14709v1 by Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong -The neonatal period is the most vulnerable time for the development of -seizures. Seizures in the immature brain lead to detrimental consequences, -therefore require early diagnosis. The gold-standard for neonatal seizure -detection currently relies on continuous video-EEG monitoring; which involves -recording multi-channel electroencephalogram (EEG) alongside real-time video -monitoring within a neonatal intensive care unit (NICU). However, video-EEG -monitoring technology requires clinical expertise and is often limited to -technologically advanced and resourceful settings. Cost-effective new -techniques could help the medical fraternity make an accurate diagnosis and -advocate treatment without delay. In this work, a novel explainable deep -learning model to automate the neonatal seizure detection process with a -reduced EEG montage is proposed, which employs convolutional nets, graph -attention layers, and fully connected layers. Beyond its ability to detect -seizures in real-time with a reduced montage, this model offers the unique -advantage of real-time interpretability. By evaluating the performance on the -Zenodo dataset with 10-fold cross-validation, the presented model achieves an -absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall, -respectively. +Data-efficient pretraining has shown tremendous potential to elevate scaling +laws. This paper argues that effective pretraining data should be curated at +the group level, treating a set of data points as a whole rather than as +independent contributors. To achieve that, we propose Group-Level Data +Influence Modeling (Group-MATES), a novel data-efficient pretraining method +that captures and optimizes group-level data utility. Specifically, Group-MATES +collects oracle group-level influences by locally probing the pretraining model +with data sets. It then fine-tunes a relational data influence model to +approximate oracles as relationship-weighted aggregations of individual +influences. The fine-tuned model selects the data subset by maximizing its +group-level influence prediction, with influence-aware clustering to enable +efficient inference. Experiments on the DCLM benchmark demonstrate that +Group-MATES achieves a 10% relative core score improvement on 22 downstream +tasks over DCLM-Baseline and 5% over individual-influence-based methods, +establishing a new state-of-the-art. Further analyses highlight the +effectiveness of relational data influence models in capturing intricate +interactions between data points. -摘要:新生兒期是大腦發育最脆弱的時期,容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果,因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測;其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而,視訊腦電圖監控技術需要臨床專業知識,而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中,提出了一個新穎的可解釋深度學習模型,以自動化新生兒癲癇發作偵測過程,並採用減少的腦電圖裝置,其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外,此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能,所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。 +摘要:資料有效的預訓練已展現出提升規模化定律的巨大潛力。本文認為,有效的預訓練資料應在群組層級中進行策展,將資料點集合視為一個整體,而非獨立的貢獻者。為達成此目的,我們提出群組層級資料影響建模(Group-MATES),這是一種新穎的資料有效預訓練方法,可擷取和最佳化群組層級資料效用。具體而言,Group-MATES 透過使用資料集在區域探測預訓練模型,收集神諭群組層級影響。接著,微調關係資料影響模型,以關係加權聚合個別影響來近似神諭。微調模型透過最大化其群組層級影響預測,選取資料子集,並透過考量影響的群集,啟用有效率的推論。在 DCLM 基準上的實驗證明,與 DCLM-Baseline 相比,Group-MATES 在 22 個下游任務上達成 10% 的相對核心分數提升,並比基於個別影響的方法高出 5%,建立了新的技術水準。進一步的分析強調了關係資料影響模型在擷取資料點之間的複雜互動上的有效性。 -##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques** -2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik +##### **Human Misperception of Generative-AI Alignment: A Laboratory Experiment** +2502.14708v1 by Kevin He, Ran Shorrer, Mengjia Xia -Breast cancer (BC) stands as one of the most common malignancies affecting -women worldwide, necessitating advancements in diagnostic methodologies for -better clinical outcomes. This article provides a comprehensive exploration of -the application of Explainable Artificial Intelligence (XAI) techniques in the -detection and diagnosis of breast cancer. As Artificial Intelligence (AI) -technologies continue to permeate the healthcare sector, particularly in -oncology, the need for transparent and interpretable models becomes imperative -to enhance clinical decision-making and patient care. This review discusses the -integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and -others, with machine learning and deep learning models utilized in breast -cancer detection and classification. By investigating the modalities of breast -cancer datasets, including mammograms, ultrasounds and their processing with -AI, the paper highlights how XAI can lead to more accurate diagnoses and -personalized treatment plans. It also examines the challenges in implementing -these techniques and the importance of developing standardized metrics for -evaluating XAI's effectiveness in clinical settings. Through detailed analysis -and discussion, this article aims to highlight the potential of XAI in bridging -the gap between complex AI models and practical healthcare applications, -thereby fostering trust and understanding among medical professionals and -improving patient outcomes. +We conduct an incentivized laboratory experiment to study people's perception +of generative artificial intelligence (GenAI) alignment in the context of +economic decision-making. Using a panel of economic problems spanning the +domains of risk, time preference, social preference, and strategic +interactions, we ask human subjects to make choices for themselves and to +predict the choices made by GenAI on behalf of a human user. We find that +people overestimate the degree of alignment between GenAI's choices and human +choices. In every problem, human subjects' average prediction about GenAI's +choice is substantially closer to the average human-subject choice than it is +to the GenAI choice. At the individual level, different subjects' predictions +about GenAI's choice in a given problem are highly correlated with their own +choices in the same problem. We explore the implications of people +overestimating GenAI alignment in a simple theoretical model. -摘要:乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一,因此需要進步的診斷方法,以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域,特別是在腫瘤學中,透明且可解釋的模型需求變得勢在必行,以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合,例如 SHAP、LIME、Grad-CAM 等,以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式,包括乳房攝影、超音波及其在 AI 中的處理,本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰,以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論,本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力,進而促進醫療專業人員之間的信任與理解,並改善患者的結果。 +摘要:我們進行一項誘因實驗室實驗,以研究人們對生成式人工智慧 (GenAI) 在經濟決策制定中的對齊認知。使用涵蓋風險、時間偏好、社會偏好和策略性互動領域的經濟問題小組,我們要求受試者為自己做出選擇,並預測 GenAI 代表人類使用者做出的選擇。我們發現人們高估了 GenAI 選擇和人類選擇之間的對齊程度。在每個問題中,受試者對 GenAI 選擇的平均預測都比對 GenAI 選擇的預測更接近於平均人類受試者選擇。在個人層面上,不同受試者對特定問題中 GenAI 選擇的預測與他們在同一個問題中的選擇高度相關。我們在一個簡單的理論模型中探討了人們高估 GenAI 對齊的影響。 -##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition** -2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara +##### **Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting** +2502.14704v1 by Yuxuan Yang, Dalin Zhang, Yuxuan Liang, Hua Lu, Huan Li, Gang Chen -Speech emotion recognition (SER) has gained significant attention due to its -several application fields, such as mental health, education, and -human-computer interaction. However, the accuracy of SER systems is hindered by -high-dimensional feature sets that may contain irrelevant and redundant -information. To overcome this challenge, this study proposes an iterative -feature boosting approach for SER that emphasizes feature relevance and -explainability to enhance machine learning model performance. Our approach -involves meticulous feature selection and analysis to build efficient SER -systems. In addressing our main problem through model explainability, we employ -a feature evaluation loop with Shapley values to iteratively refine feature -sets. This process strikes a balance between model performance and -transparency, which enables a comprehensive understanding of the model's -predictions. The proposed approach offers several advantages, including the -identification and removal of irrelevant and redundant features, leading to a -more effective model. Additionally, it promotes explainability, facilitating -comprehension of the model's predictions and the identification of crucial -features for emotion determination. The effectiveness of the proposed method is -validated on the SER benchmarks of the Toronto emotional speech set (TESS), -Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of -Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion -(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our -knowledge, this is the first work to incorporate model explainability into an -SER framework. The source code of this paper is publicly available via this -https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition. +Time Series Forecasting (TSF) is a crucial task in various domains, yet +existing TSF models rely heavily on high-quality data and insufficiently +exploit all available data. This paper explores a novel self-supervised +approach to re-label time series datasets by inherently constructing candidate +datasets. During the optimization of a simple reconstruction network, +intermediates are used as pseudo labels in a self-supervised paradigm, +improving generalization for any predictor. We introduce the Self-Correction +with Adaptive Mask (SCAM), which discards overfitted components and selectively +replaces them with pseudo labels generated from reconstructions. Additionally, +we incorporate Spectral Norm Regularization (SNR) to further suppress +overfitting from a loss landscape perspective. Our experiments on eleven +real-world datasets demonstrate that SCAM consistently improves the performance +of various backbone models. This work offers a new perspective on constructing +datasets and enhancing the generalization of TSF models through self-supervised +learning. -摘要:語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而,SER 系統的準確性受到高維特徵集的阻礙,這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰,本研究提出了一種用於 SER 的迭代特徵提升方法,該方法強調特徵相關性和可解釋性,以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析,以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題,我們採用了具有 Shapley 值的特徵評估迴圈,以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡,這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點,包括識別和移除不相關和冗餘的特徵,從而建立更有效的模型。此外,它促進了可解釋性,有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證,其效能優於現有方法。據我們所知,這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得:https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。 +摘要:時間序列預測 (TSF) 在各個領域中都是一項重要的任務,但現有的 TSF 模型極度依賴高品質的資料,且無法充分利用所有可用的資料。本文探討了一種新穎的自監督方法,藉由內建地建構候選資料集來重新標記時間序列資料集。在最佳化一個簡單的重建網路過程中,中間產物會在自監督範例中作為偽標籤,進而改善任何預測器的概化能力。我們引入了帶有自適應遮罩 (SCAM) 的自我修正,它會捨棄過度擬合的組成,並選擇性地以從重建產生的偽標籤取代它們。此外,我們納入了頻譜範數正規化 (SNR) 來進一步抑制從損失景觀觀點來看產生的過度擬合。我們在 11 個真實世界的資料集上進行的實驗,證明 SCAM 持續改善各種主幹模型的效能。這項工作提供了建構資料集和透過自監督學習來提升 TSF 模型概化能力的新觀點。 -##### **The Explanation Necessity for Healthcare AI** -2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling +##### **I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search** +2502.14693v1 by Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu -Explainability is often critical to the acceptable implementation of -artificial intelligence (AI). Nowhere is this more important than healthcare -where decision-making directly impacts patients and trust in AI systems is -essential. This trust is often built on the explanations and interpretations -the AI provides. Despite significant advancements in AI interpretability, there -remains the need for clear guidelines on when and to what extent explanations -are necessary in the medical context. We propose a novel categorization system -with four distinct classes of explanation necessity, guiding the level of -explanation required: patient or sample (local) level, cohort or dataset -(global) level, or both levels. We introduce a mathematical formulation that -distinguishes these categories and offers a practical framework for researchers -to determine the necessity and depth of explanations required in medical AI -applications. Three key factors are considered: the robustness of the -evaluation protocol, the variability of expert observations, and the -representation dimensionality of the application. In this perspective, we -address the question: When does an AI medical application need to be explained, -and at what level of detail? +Recent advancements in large language models (LLMs) have shown remarkable +potential in automating machine learning tasks. However, existing LLM-based +agents often struggle with low-diversity and suboptimal code generation. While +recent work has introduced Monte Carlo Tree Search (MCTS) to address these +issues, limitations persist in the quality and diversity of thoughts generated, +as well as in the scalar value feedback mechanisms used for node selection. In +this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a +novel approach that iteratively expands tree nodes through an introspective +process that meticulously analyzes solutions and results from parent and +sibling nodes. This facilitates a continuous refinement of the node in the +search tree, thereby enhancing the overall decision-making process.Furthermore, +we integrate a Large Language Model (LLM)-based value model to facilitate +direct evaluation of each node's solution prior to conducting comprehensive +computational rollouts. A hybrid rewarding mechanism is implemented to +seamlessly transition the Q-value from LLM-estimated scores to actual +performance scores. This allows higher-quality nodes to be traversed +earlier.Applied to the various ML tasks, our approach demonstrates a6\% +absolute improvement in performance compared to the strong open-source AutoML +agents, showcasing its effectiveness in enhancing agentic AutoML systems. -摘要:可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域,这一点尤为重要,因为决策直接影响患者,并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展,但仍然需要明确的指导方针,说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统,该系统具有四种不同的解释必要性类别,指导所需的解释级别:患者或样本(局部)级别、队列或数据集(全局)级别,或两个级别。我们引入了一个数学公式,该公式区分了这些类别,并为研究人员提供了一个实用框架,以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素:评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看,我们解决了这个问题:AI 医疗应用何时需要解释,以及需要解释到何种程度? +摘要:大型語言模型 (LLM) 的最新進展已展現出自動化機器學習任務的顯著潛力。然而,現有的基於 LLM 的代理通常會遇到低多樣性和次優代碼生成的問題。雖然最近的工作已引入蒙地卡羅樹搜尋 (MCTS) 來解決這些問題,但仍存在於所產生想法的品質和多樣性,以及用於節點選擇的標量值回饋機制中。在本研究中,我們介紹了內省蒙地卡羅樹搜尋 (I-MCTS),這是一種透過內省過程反覆擴展樹節點的新方法,該過程會細緻地分析來自父節點和同層節點的解決方案和結果。這有助於持續改善搜尋樹中的節點,進而增強整體決策制定過程。此外,我們整合了一個基於大型語言模型 (LLM) 的值模型,以便在進行全面運算展開之前直接評估每個節點的解決方案。實作了一種混合獎勵機制,以無縫地將 Q 值從 LLM 估計分數轉換為實際效能分數。這允許較高品質的節點更早被遍歷。應用於各種 ML 任務,我們的做法展示出比強大的開源 AutoML 代理高出 6% 的絕對效能提升,證明了其在增強代理式 AutoML 系統方面的有效性。 -##### **Interdisciplinary Expertise to Advance Equitable Explainable AI** -2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles +##### **Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup** +2502.14682v1 by Yonghui Kong, Hongbing Hu, Dan Zhang, Siyuan Chai, Fan Zhang, Wei Wang -The field of artificial intelligence (AI) is rapidly influencing health and -healthcare, but bias and poor performance persists for populations who face -widespread structural oppression. Previous work has clearly outlined the need -for more rigorous attention to data representativeness and model performance to -advance equity and reduce bias. However, there is an opportunity to also -improve the explainability of AI by leveraging best practices of social -epidemiology and health equity to help us develop hypotheses for associations -found. In this paper, we focus on explainable AI (XAI) and describe a framework -for interdisciplinary expert panel review to discuss and critically assess AI -model explanations from multiple perspectives and identify areas of bias and -directions for future research. We emphasize the importance of the -interdisciplinary expert panel to produce more accurate, equitable -interpretations which are historically and contextually informed. -Interdisciplinary panel discussions can help reduce bias, identify potential -confounders, and identify opportunities for additional research where there are -gaps in the literature. In turn, these insights can suggest opportunities for -AI model improvement. +Large language models have demonstrated excellent performance in many tasks, +including Text-to-SQL, due to their powerful in-context learning capabilities. +They are becoming the mainstream approach for Text-to-SQL. However, these +methods still have a significant gap compared to human performance, especially +on complex questions. As the complexity of questions increases, the gap between +questions and SQLs increases. We identify two important gaps: the structural +mapping gap and the lexical mapping gap. To tackle these two gaps, we propose +PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates +gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM). +AQP aims to obtain the structural pattern of the question by removing +database-related information, which enables us to find structurally similar +demonstrations. CSM aims to associate database-related text span in the +question with specific tables or columns in the database, which alleviates the +lexical mapping gap. Experimental results on the Spider and BIRD datasets +demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL + +GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution +accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an +execution accuracy of 64.67\%. + +摘要:大型語言模型在許多任務中表現出色,包括文字轉 SQL,這歸功於它們強大的情境學習能力。它們正成為文字轉 SQL 的主流方法。然而,這些方法與人類的表現仍有顯著差距,特別是在複雜的問題上。隨著問題的複雜性增加,問題和 SQL 之間的差距也隨之增加。我們找出兩個重要的差距:結構對應差距和詞彙對應差距。為了解決這兩個差距,我們提出 PAS-SQL,一種基於 LLM 的高效 SQL 產生管道,它透過抽象查詢模式 (AQP) 和情境架構標記 (CSM) 來縮小差距。AQP 旨在透過移除與資料庫相關的資訊來取得問題的結構模式,這使我們能夠找到結構上相似的範例。CSM 旨在將問題中與資料庫相關的文字範圍與資料庫中的特定表格或欄位關聯起來,這可以縮小詞彙對應差距。在 Spider 和 BIRD 資料集上的實驗結果證明了我們所提出的方法的有效性。具體來說,PAS-SQL + GPT-4o 在 Spider 基準測試中設定了一個新的技術水準,執行準確度為 87.9%,並在 BIRD 資料集上取得領先的結果,執行準確度為 64.67%。 + +##### **How to Get Your LLM to Generate Challenging Problems for Evaluation** +2502.14678v1 by Arkil Patel, Siva Reddy, Dzmitry Bahdanau + +The pace of evolution of Large Language Models (LLMs) necessitates new +approaches for rigorous and comprehensive evaluation. Traditional human +annotation is increasingly impracticable due to the complexities and costs +involved in generating high-quality, challenging problems. In this work, we +introduce CHASE, a unified framework to synthetically generate challenging +problems using LLMs without human involvement. For a given task, our approach +builds a hard problem in a bottom-up manner from simpler components. Moreover, +our framework decomposes the generation process into independently verifiable +sub-tasks, thereby ensuring a high level of quality and correctness. We +implement CHASE to create evaluation benchmarks across three diverse domains: +(1) document-based question answering, (2) repository-level code completion, +and (3) math reasoning. The performance of state-of-the-art LLMs on these +synthetic benchmarks lies in the range of 40-60% accuracy, thereby +demonstrating the effectiveness of our framework at generating challenging +problems. We publicly release our benchmarks and code. -摘要:人工智慧 (AI) 領域正快速影響著健康與醫療保健,但對於面臨廣泛結構性壓迫的人群來說,偏見和不良表現依然存在。先前的研究已清楚說明,需要更嚴格地注意資料代表性和模型效能,以促進公平性並減少偏見。然而,我們有機會透過運用社會流行病學和健康公平的最佳實務,來改善 AI 的可解釋性,以幫助我們針對發現的關聯性,發展假設。在本文中,我們專注於可解釋 AI (XAI),並描述一個跨領域專家小組審查架構,以從多重觀點討論和批判性評估 AI 模型的解釋,並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要,而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素,並在文獻中有缺口時找出額外研究的機會。反過來,這些見解可以建議 AI 模型改進的機會。 +摘要:大型語言模型 (LLM) 的演化速度需要新的方法來進行嚴謹且全面的評估。由於產生高品質、具挑戰性的問題所涉及的複雜性和成本,傳統的人工標註正變得越來越不可行。在這項工作中,我們介紹了 CHASE,一個統一的框架,用於使用 LLM 合成產生具有挑戰性的問題,而無需人工參與。對於給定的任務,我們的做法是以自下而上的方式從更簡單的組成部分來建立一個困難的問題。此外,我們的框架將生成過程分解為獨立可驗證的子任務,從而確保高品質和正確性。我們實作 CHASE 來建立三個不同領域的評估基準:(1) 基於文件的問答、(2) 儲存庫層級的程式碼完成,以及 (3) 數學推理。最先進的 LLM 在這些合成基準上的效能落在 40-60% 的準確度範圍內,從而證明了我們的框架在產生具有挑戰性的問題上的有效性。我們公開發布我們的基準和程式碼。 -##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts** -2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen +##### **Data-Constrained Synthesis of Training Data for De-Identification** +2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis -Artificial Intelligence (AI) repeatedly match or outperform radiologists in -lab experiments. However, real-world implementations of radiological AI-based -systems are found to provide little to no clinical value. This paper explores -how to design AI for clinical usefulness in different contexts. We conducted 19 -design sessions and design interventions with 13 radiologists from 7 clinical -sites in Denmark and Kenya, based on three iterations of a functional AI-based -prototype. Ten sociotechnical dependencies were identified as crucial for the -design of AI in radiology. We conceptualised four technical dimensions that -must be configured to the intended clinical context of use: AI functionality, -AI medical focus, AI decision threshold, and AI Explainability. We present four -design recommendations on how to address dependencies pertaining to the medical -knowledge, clinic type, user expertise level, patient context, and user -situation that condition the configuration of these technical dimensions. +Many sensitive domains -- such as the clinical domain -- lack widely +available datasets due to privacy risks. The increasing generative capabilities +of large language models (LLMs) have made synthetic datasets a viable path +forward. In this study, we domain-adapt LLMs to the clinical domain and +generate synthetic clinical texts that are machine-annotated with tags for +personally identifiable information using capable encoder-based NER models. The +synthetic corpora are then used to train synthetic NER models. The results show +that training NER models using synthetic corpora incurs only a small drop in +predictive performance. The limits of this process are investigated in a +systematic ablation study -- using both Swedish and Spanish data. Our analysis +shows that smaller datasets can be sufficient for domain-adapting LLMs for data +synthesis. Instead, the effectiveness of this process is almost entirely +contingent on the performance of the machine-annotating NER models trained +using the original data. -摘要:人工智慧(AI)在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而,發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代,在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向,必須根據預期的臨床使用情境進行設定:AI 功能、AI 醫療重點、AI 決策門檻,以及 AI 可解釋性。我們提出四項設計建議,說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境,以及影響這些技術面向設定的使用者情境相關的依賴關係。 +摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 -##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making** -2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah +##### **BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction** +2502.14676v1 by Ruochen Li, Stamos Katsigiannis, Tae-Kyun Kim, Hubert P. H. Shum -With advanced AI/ML, there has been growing research on explainable AI (XAI) -and studies on how humans interact with AI and XAI for effective human-AI -collaborative decision-making. However, we still have a lack of understanding -of how AI systems and XAI should be first presented to users without technical -backgrounds. In this paper, we present the findings of semi-structured -interviews with health professionals (n=12) and students (n=4) majoring in -medicine and health to study how to improve onboarding with AI and XAI. For the -interviews, we built upon human-AI interaction guidelines to create onboarding -materials of an AI system for stroke rehabilitation assessment and AI -explanations and introduce them to the participants. Our findings reveal that -beyond presenting traditional performance metrics on AI, participants desired -benchmark information, the practical benefits of AI, and interaction trials to -better contextualize AI performance, and refine the objectives and performance -of AI. Based on these findings, we highlight directions for improving -onboarding with AI and XAI and human-AI collaborative decision-making. +Trajectory prediction allows better decision-making in applications of +autonomous vehicles or surveillance by predicting the short-term future +movement of traffic agents. It is classified into pedestrian or heterogeneous +trajectory prediction. The former exploits the relatively consistent behavior +of pedestrians, but is limited in real-world scenarios with heterogeneous +traffic agents such as cyclists and vehicles. The latter typically relies on +extra class label information to distinguish the heterogeneous agents, but such +labels are costly to annotate and cannot be generalized to represent different +behaviors within the same class of agents. In this work, we introduce the +behavioral pseudo-labels that effectively capture the behavior distributions of +pedestrians and heterogeneous agents solely based on their motion features, +significantly improving the accuracy of trajectory prediction. To implement the +framework, we propose the Behavioral Pseudo-Label Informed Sparse Graph +Convolution Network (BP-SGCN) that learns pseudo-labels and informs to a +trajectory predictor. For optimization, we propose a cascaded training scheme, +in which we first learn the pseudo-labels in an unsupervised manner, and then +perform end-to-end fine-tuning on the labels in the direction of increasing the +trajectory prediction accuracy. Experiments show that our pseudo-labels +effectively model different behavior clusters and improve trajectory +prediction. Our proposed BP-SGCN outperforms existing methods using both +pedestrian (ETH/UCY, pedestrian-only SDD) and heterogeneous agent datasets +(SDD, Argoverse 1). -摘要:隨著先進的 AI/ML,對可解釋 AI (XAI) 的研究不斷增加,以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而,我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中,我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果,以研究如何改善 AI 和 XAI 的入門。對於訪談,我們建立在人機互動準則之上,為中風康復評估和 AI 解釋的 AI 系統創建入門材料,並將它們介紹給參與者。我們的研究結果表明,除了呈現傳統的 AI 性能指標外,參與者還希望基准信息、AI 的實際好處以及交互試驗,以更好地將 AI 性能情境化,並完善 AI 的目標和性能。根據這些發現,我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。 +摘要:軌跡預測允許在自動駕駛車輛或監視應用中做出更好的決策,藉由預測交通代理的短期未來移動。它被分類為行人或異質軌跡預測。前者利用行人相對一致的行為,但受限於與自行車騎士和車輛等異質交通代理的真實世界場景。後者通常依賴額外的類別標籤資訊來區分異質代理,但此類標籤的註解成本很高,且無法概括為表示同一類別代理中的不同行為。在這項工作中,我們引入了行為偽標籤,它僅根據行人和異質代理的運動特徵有效捕捉行為分佈,顯著提升軌跡預測的準確度。為實作架構,我們提出了行為偽標籤告知稀疏圖形卷積網路 (BP-SGCN),它學習偽標籤並告知軌跡預測器。針對最佳化,我們提出了一種串聯訓練方案,其中我們首先以非監督的方式學習偽標籤,然後在標籤上執行端到端微調,朝著提升軌跡預測準確度的方向進行。實驗顯示我們的偽標籤有效建模不同的行為叢集,並提升軌跡預測。我們提出的 BP-SGCN 使用行人 (ETH/UCY,僅限行人的 SDD) 和異質代理資料集 (SDD,Argoverse 1) 都優於現有方法。 -##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach** -2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao +##### **Explanations of Deep Language Models Explain Language Representations in the Brain** +2502.14671v1 by Maryam Rahimi, Yadollah Yaghoobzadeh, Mohammad Reza Daliri -This article uses machine learning (ML) and explainable artificial -intelligence (XAI) techniques to investigate the relationship between -nutritional status and mortality rates associated with Alzheimers disease (AD). -The Third National Health and Nutrition Examination Survey (NHANES III) -database is employed for analysis. The random forest model is selected as the -base model for XAI analysis, and the Shapley Additive Explanations (SHAP) -method is used to assess feature importance. The results highlight significant -nutritional factors such as serum vitamin B12 and glycated hemoglobin. The -study demonstrates the effectiveness of random forests in predicting AD -mortality compared to other diseases. This research provides insights into the -impact of nutrition on AD and contributes to a deeper understanding of disease -progression. +Recent advances in artificial intelligence have given rise to large language +models (LLMs) that not only achieve human-like performance but also share +computational principles with the brain's language processing mechanisms. While +previous research has primarily focused on aligning LLMs' internal +representations with neural activity, we introduce a novel approach that +leverages explainable AI (XAI) methods to forge deeper connections between the +two domains. Using attribution methods, we quantified how preceding words +contribute to an LLM's next-word predictions and employed these explanations to +predict fMRI recordings from participants listening to the same narratives. Our +findings demonstrate that attribution methods robustly predict brain activity +across the language network, surpassing traditional internal representations in +early language areas. This alignment is hierarchical: early-layer explanations +correspond to the initial stages of language processing in the brain, while +later layers align with more advanced stages. Moreover, the layers more +influential on LLM next-word prediction$\unicode{x2014}$those with higher +attribution scores$\unicode{x2014}$exhibited stronger alignment with neural +activity. This work establishes a bidirectional bridge between AI and +neuroscience. First, we demonstrate that attribution methods offer a powerful +lens for investigating the neural mechanisms of language comprehension, +revealing how meaning emerges from preceding context. Second, we propose using +brain alignment as a metric to evaluate the validity of attribution methods, +providing a framework for assessing their biological plausibility. -摘要:本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型,並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素,例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解,並有助於更深入地了解疾病的進展。 +摘要:最近的人工智能的進展產生了大型語言模型 (LLM),它不僅達到類似人類的表現,還與大腦的語言處理機制共享計算原理。雖然先前的研究主要集中於將 LLM 的內部表徵與神經活動對齊,但我們引入了一種新穎的方法,該方法利用可解釋 AI (XAI) 方法在兩個域之間建立更深層的聯繫。使用歸因方法,我們量化了前一個單詞如何促成 LLM 的下一個單詞預測,並利用這些解釋來預測參與者在聆聽相同敘述時的大腦功能性磁共振造影 (fMRI) 記錄。我們的發現表明,歸因方法可以穩健地預測整個語言網路中的大腦活動,超越了早期語言區域中的傳統內部表徵。這種對齊是分層的:早期層次解釋對應於大腦中語言處理的初始階段,而後續層次則與更進階的階段對齊。此外,對 LLM 下一個單詞預測影響力較大的層次(即歸因分數較高的層次)表現出與神經活動更強的對齊。這項工作在 AI 與神經科學之間建立了一個雙向橋樑。首先,我們證明歸因方法提供了一個強大的視角,用於研究語言理解的神經機制,揭示意義如何從先前的脈絡中產生。其次,我們建議使用大腦對齊作為評估歸因方法有效性的指標,提供了一個評估其生物學合理性的框架。 -##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone** -2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath +##### **AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO** +2502.14669v1 by Alan Dao, Dinh Bach Vu -Primary care providers are vital for initial triage and referrals to -specialty care. In glaucoma, asymptomatic and fast progression can lead to -vision loss, necessitating timely referrals to specialists. However, primary -eye care providers may not identify urgent cases, potentially delaying care. -Artificial Intelligence (AI) offering explanations could enhance their referral -decisions. We investigate how various AI explanations help providers -distinguish between patients needing immediate or non-urgent specialist -referrals. We built explainable AI algorithms to predict glaucoma surgery needs -from routine eyecare data as a proxy for identifying high-risk patients. We -incorporated intrinsic and post-hoc explainability and conducted an online -study with optometrists to assess human-AI team performance, measuring referral -accuracy and analyzing interactions with AI, including agreement rates, task -time, and user experience perceptions. AI support enhanced referral accuracy -among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams -underperformed compared to AI alone. Participants believed they included AI -advice more when using the intrinsic model, and perceived it more useful and -promising. Without explanations, deviations from AI recommendations increased. -AI support did not increase workload, confidence, and trust, but reduced -challenges. On a separate test set, our black-box and intrinsic models achieved -an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We -identify opportunities of human-AI teaming for glaucoma management in primary -eye care, noting that while AI enhances referral accuracy, it also shows a -performance gap compared to AI alone, even with explanations. Human involvement -remains essential in medical decision making, underscoring the need for future -research to optimize collaboration, ensuring positive experiences and safe AI -use. +Large Language Models (LLMs) have demonstrated impressive capabilities in +language processing, yet they often struggle with tasks requiring genuine +visual spatial reasoning. In this paper, we introduce a novel two-stage +training framework designed to equip standard LLMs with visual reasoning +abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) +on a curated dataset of tokenized maze representations to teach the model to +predict step-by-step movement commands. Next, we apply Group Relative Policy +Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted +reward function to refine the model's sequential decision-making and encourage +emergent chain-of-thought behaviors. Experimental results on synthetically +generated mazes show that while a baseline model fails to navigate the maze, +the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning +boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more +robust and self-corrective reasoning, highlighting the potential of our +approach to bridge the gap between language models and visual spatial tasks. +These findings offer promising implications for applications in robotics, +autonomous navigation, and other domains that require integrated visual and +sequential reasoning. -摘要:初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下,無症狀且快速惡化可能導致視力喪失,因此需要及時轉診給專家。然而,初級眼科保健提供者可能無法識別緊急情況,可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法,以從例行眼科護理資料預測青光眼手術需求,作為識別高風險患者的代理。我們納入了內在和事後解釋性,並與驗光師進行了一項線上研究,以評估人機團隊的表現,衡量轉診準確度並分析與 AI 的互動,包括同意率、任務時間和使用者體驗感知。在 87 名參與者中,AI 支援提高了轉診準確度(使用 AI/未使用的比例為 59.9%/50.8%),儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議,並認為它更有用且更有希望。沒有解釋,AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任,但減少了挑戰。在一個單獨的測試集中,我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中,人機團隊合作管理青光眼的機會,並注意到雖然 AI 提高了轉診準確度,但即使有解釋,它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要,這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。 +摘要:大型語言模型(LLM)在語言處理方面展現出令人印象深刻的能力,但它們經常難以應付需要真正視覺空間推理的任務。在本文中,我們介紹了一種新穎的兩階段訓練架構,旨在為標準 LLM 提供迷宮導航的視覺推理能力。首先,我們在標記化迷宮表示的策展資料集上利用監督微調(SFT)來教導模型預測逐步移動指令。接下來,我們使用 DeepSeekR1 中使用的技術,即群體相對策略最佳化(GRPO),並搭配精心設計的獎勵函數來優化模型的順序決策制定,並鼓勵出現連貫的思考行為。在合成產生的迷宮上進行的實驗結果顯示,雖然基準模型無法導航迷宮,但經過 SFT 訓練的模型達到 86% 的準確度,而進一步的 GRPO 微調將準確度提升至 93%。定性分析顯示,GRPO 促進更強健且自我修正的推理,凸顯了我們的方法在彌合語言模型與視覺空間任務之間差距的潛力。這些發現為機器人、自主導航和其他需要整合視覺和順序推理的領域的應用提供了有希望的啟示。 -##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery** -2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang +##### **InstructAgent: Building User Controllable Recommender via LLM Agent** +2502.14662v1 by Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang -In medical imaging, particularly in early disease detection and prognosis -tasks, discerning the rationale behind an AI model's predictions is crucial for -evaluating the reliability of its decisions. Conventional explanation methods -face challenges in identifying discernible decisive features in medical image -classifications, where discriminative features are subtle or not immediately -apparent. To bridge this gap, we propose an explainable model that is equipped -with both decision reasoning and feature identification capabilities. Our -approach not only detects influential image patterns but also uncovers the -decisive features that drive the model's final predictions. By implementing our -method, we can efficiently identify and visualise class-specific features -leveraged by the data-driven model, providing insights into the decision-making -processes of deep learning models. We validated our model in the demanding -realm of medical prognosis task, demonstrating its efficacy and potential in -enhancing the reliability of AI in healthcare and in discovering new knowledge -in diseases where prognostic understanding is limited. +Traditional recommender systems usually take the user-platform paradigm, +where users are directly exposed under the control of the platform's +recommendation algorithms. However, the defect of recommendation algorithms may +put users in very vulnerable positions under this paradigm. First, many +sophisticated models are often designed with commercial objectives in mind, +focusing on the platform's benefits, which may hinder their ability to protect +and capture users' true interests. Second, these models are typically optimized +using data from all users, which may overlook individual user's preferences. +Due to these shortcomings, users may experience several disadvantages under the +traditional user-platform direct exposure paradigm, such as lack of control +over the recommender system, potential manipulation by the platform, echo +chamber effects, or lack of personalization for less active users due to the +dominance of active users during collaborative learning. Therefore, there is an +urgent need to develop a new paradigm to protect user interests and alleviate +these issues. Recently, some researchers have introduced LLM agents to simulate +user behaviors, these approaches primarily aim to optimize platform-side +performance, leaving core issues in recommender systems unresolved. To address +these limitations, we propose a new user-agent-platform paradigm, where agent +serves as the protective shield between user and recommender system that +enables indirect exposure. To this end, we first construct four recommendation +datasets, denoted as $\dataset$, along with user instructions for each record. -摘要:在醫學影像中,特別是在早期疾病檢測和預後任務中,辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰,其中區別性特徵很微妙或並不明顯。為了彌合這一差距,我們提出了一個可解釋的模型,該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式,還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型,我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵,從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型,展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。 +摘要:傳統推薦系統通常採用使用者-平台範例, +其中使用者直接暴露在平台推薦演算法的控制之下。然而,推薦演算法的缺陷可能會讓使用者在這個範例中處於非常脆弱的位置。首先,許多精密的模型通常在設計時就考慮到商業目標,專注於平台的利益,這可能會阻礙它們保護和掌握使用者真正興趣的能力。其次,這些模型通常使用所有使用者的資料進行最佳化,這可能會忽略個別使用者的偏好。由於這些缺點,使用者可能會在傳統使用者-平台直接暴露範例中遇到一些缺點,例如缺乏對推薦系統的控制、平台的潛在操縱、同溫層效應,或由於活躍使用者在協作學習中的主導地位而缺乏針對較不活躍使用者的個人化。因此,迫切需要開發一種新的範例來保護使用者利益並緩解這些問題。最近,一些研究人員引入了 LLM 代理程式來模擬使用者行為,這些方法主要旨在最佳化平台端的效能,而未解決推薦系統中的核心問題。為了解決這些限制,我們提出了一種新的使用者-代理程式-平台範例,其中代理程式作為使用者和推薦系統之間的保護盾,實現間接暴露。為此,我們首先構建了四個推薦資料集,表示為 $\dataset$,以及每條記錄的使用者說明。 -##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach** -2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat +##### **Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs** +2502.14645v1 by Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao -This study explores the relationship between informational support seeking -questions, responses, and helpfulness ratings in online health communities. We -created a labeled data set of question-response pairs and developed multimodal -machine learning and deep learning models to reliably predict informational -support questions and responses. We employed explainable AI to reveal the -emotions embedded in informational support exchanges, demonstrating the -importance of emotion in providing informational support. This complex -interplay between emotional and informational support has not been previously -researched. The study refines social support theory and lays the groundwork for -the development of user decision aids. Further implications are discussed. +Knowledge editing allows for efficient adaptation of large language models +(LLMs) to new information or corrections without requiring full retraining. +However, prior methods typically focus on either single-language editing or +basic multilingual editing, failing to achieve true cross-linguistic knowledge +synchronization. To address this, we present a simple and practical +state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), +designed to propagate knowledge from a dominant language to other languages +effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition +Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel +dataset to modify in-scope knowledge while preserving unrelated information, +and (ii) Target-language Preference Optimization (TL-PO), which applies +advanced optimization techniques to ensure consistency across languages, +fostering the transfer of updates. Additionally, we contribute a high-quality, +cross-lingual dataset, specifically designed to enhance knowledge transfer +across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks +show that X-KDE significantly enhances cross-lingual performance, achieving an +average improvement of +8.19%, while maintaining high accuracy in monolingual +settings. + +摘要:知識編輯允許大語言模型 (LLM) 有效地適應新資訊或修正,而無需進行完整的再訓練。 +然而,先前的做法通常專注於單一語言編輯或基本的語音編輯,未能實現真正的跨語言知識同步。為了解決這個問題,我們提出了一個簡單且實用的最先進 (SOTA) 配方,即跨語言知識民主編輯 (X-KDE),旨在有效地從主導語言傳播知識到其他語言。我們的 X-KDE 包含兩個階段:(i) 跨語言版本指令調整 (XE-IT),它微調模型,在經過整理的平行資料集上修改範圍內的知識,同時保留不相關的資訊,以及 (ii) 目標語言偏好最佳化 (TL-PO),它應用先進的最佳化技術,以確保跨語言的一致性,促進更新的傳輸。此外,我們貢獻了一個高品質的跨語言資料集,特別設計用於增強跨語言的知識傳輸。在 Bi-ZsRE 和 MzsRE 基準上的廣泛實驗表明,X-KDE 大幅提升了跨語言效能,在單語言設定中維持高準確度的同時,平均提升了 +8.19%。 -摘要:本研究探討線上健康社群中尋求資訊支持的問題、回應,以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集,並開發了多模態機器學習和深度學習模型,以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒,證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論,並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。 +##### **LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning** +2502.14644v1 by Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang -##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education** -2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis +Long context understanding remains challenging for large language models due +to their limited context windows. This paper presents Long Input Fine-Tuning +(LIFT), a novel framework for long-context modeling that can improve the +long-context performance of arbitrary (short-context) LLMs by dynamically +adapting model parameters based on the long input. Importantly, LIFT, rather +than endlessly extending the context window size to accommodate increasingly +longer inputs in context, chooses to store and absorb the long input in +parameter. By fine-tuning the long input into model parameters, LIFT allows +short-context LLMs to answer questions even when the required information is +not provided in the context during inference. Furthermore, to enhance LIFT +performance while maintaining the original in-context learning (ICL) +capabilities, we introduce Gated Memory, a specialized attention adapter that +automatically balances long input memorization and ICL. We provide a +comprehensive analysis of the strengths and limitations of LIFT on long context +understanding, offering valuable directions for future research. -In the era of exponential technology growth, one unexpected guest has claimed -a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as -ChatGPT, promises a revolution in education, yet it arrives with a double-edged -sword. Its potential for personalized learning is offset by issues of cheating, -inaccuracies, and educators struggling to incorporate it effectively into their -lesson design. We are standing on the brink of this educational frontier, and -it is clear that we need to navigate this terrain with a lot of care. This is a -major challenge that could undermine the integrity and value of our educational -process. So, how can we turn these challenges into opportunities? When used -inappropriately, AI tools can become the perfect tool for the cut copy paste -mentality, and quickly begin to corrode critical thinking, creativity, and deep -understanding, the most important skills in our rapidly changing world. -Teachers feel that they are not equipped to leverage this technology, widening -the digital divide among educators and institutions. Addressing these concerns -calls for an in depth research approach. We will employ empirical research, -drawing on the Technology Acceptance Model, to assess the attitudes toward -generative AI among educators and students. Understanding their perceptions, -usage patterns, and hurdles is the first crucial step in creating an effective -solution. The present study will be used as a process manual for future -researchers to apply, running their own data, based on the steps explained here +摘要:由於大型語言模型的上下文視窗有限,因此對於它們而言,長語境理解仍然具有挑戰性。本文提出了長輸入微調 (LIFT),這是一個用於長語境建模的新穎架構,它可以通過根據長輸入動態調整模型參數來改善任意(短語境)LLM 的長語境效能。重要的是,LIFT 沒有無限擴充上下文視窗大小以容納語境中越來越長的輸入,而是選擇將長輸入儲存在參數中並吸收它。通過將長輸入微調到模型參數中,LIFT 允許短語境 LLM 回答問題,即使在推理期間語境中沒有提供所需資訊也是如此。此外,為了在保持原始語境中學習 (ICL) 能力的同時增強 LIFT 效能,我們引入了閘控記憶體,這是一個自動平衡長輸入記憶和 ICL 的特殊注意力適配器。我們對 LIFT 在長語境理解方面的優缺點進行了全面的分析,為未來的研究提供了有價值的方向。 -摘要:在科技飛速發展的時代,一位意外的訪客已在全球教室中佔有一席之地,那就是人工智慧。生成式 AI,例如 ChatGPT,承諾在教育領域掀起一場革命,但它卻是一把雙面刃。它在個人化學習方面的潛力,卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣,顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰,可能會損害我們教育過程的完整性和價值。那麼,我們如何將這些挑戰轉化為機遇?當不適當地使用時,AI 工具可能會成為複製貼上心態的完美工具,並迅速腐蝕批判性思維、創造力和深入理解,這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術,這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究,借鑑技術接受模型,來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊,根據此處說明的步驟運行他們自己的數據 +##### **Length-Controlled Margin-Based Preference Optimization without Reference Model** +2502.14643v1 by Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu -##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data** -2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk +Direct Preference Optimization (DPO) is a widely adopted offline algorithm +for preference-based reinforcement learning from human feedback (RLHF), +designed to improve training simplicity and stability by redefining reward +functions. However, DPO is hindered by several limitations, including length +bias, memory inefficiency, and probability degradation. To address these +challenges, we propose Length-Controlled Margin-Based Preference Optimization +(LMPO), a more efficient and robust alternative. LMPO introduces a uniform +reference model as an upper bound for the DPO loss, enabling a more accurate +approximation of the original optimization objective. Additionally, an average +log-probability optimization strategy is employed to minimize discrepancies +between training and inference phases. A key innovation of LMPO lies in its +Length-Controlled Margin-Based loss function, integrated within the +Bradley-Terry framework. This loss function regulates response length while +simultaneously widening the margin between preferred and rejected outputs. By +doing so, it mitigates probability degradation for both accepted and discarded +responses, addressing a significant limitation of existing methods. We evaluate +LMPO against state-of-the-art preference optimization techniques on two +open-ended large language models, Mistral and LLaMA3, across six conditional +benchmarks. Our experimental results demonstrate that LMPO effectively controls +response length, reduces probability degradation, and outperforms existing +approaches. The code is available at \url{https://github.com/gengxuli/LMPO}. -With the digitalization of health care systems, artificial intelligence -becomes more present in medicine. Especially machine learning shows great -potential for complex tasks such as time series classification, usually at the -cost of transparency and comprehensibility. This leads to a lack of trust by -humans and thus hinders its active usage. Explainable artificial intelligence -tries to close this gap by providing insight into the decision-making process, -the actual usefulness of its different methods is however unclear. This paper -proposes a user study based evaluation of the explanation method Grad-CAM with -application to a neural network for the classification of breaths in time -series neonatal ventilation data. We present the perceived usefulness of the -explainability method by different stakeholders, exposing the difficulty to -achieve actual transparency and the wish for more in-depth explanations by many -of the participants. +摘要:直接偏好優化 (DPO) 是一種廣泛採用的離線演算法,用於從人類回饋 (RLHF) 中進行基於偏好的強化學習,旨在透過重新定義獎勵函數來提升訓練的簡潔性和穩定性。然而,DPO 受到若干限制的阻礙,包括長度偏差、記憶體效率低下和機率下降。為了解決這些挑戰,我們提出長度控制邊際偏好優化 (LMPO),一種更有效率且穩健的替代方案。LMPO 引入統一參考模型作為 DPO 損失的上限,能夠更準確地近似原始最佳化目標。此外,採用平均對數機率最佳化策略來最小化訓練和推論階段之間的差異。LMPO 的一項關鍵創新在於其長度控制邊際損失函數,整合在 Bradley-Terry 架構中。此損失函數調節回應長度,同時擴大偏好和拒絕輸出之間的邊際。藉由這麼做,它減輕了已接受和已捨棄回應的機率下降,解決了現有方法的重大限制。我們在兩個開放式大型語言模型 Mistral 和 LLaMA3 上,針對六個條件基準,評估 LMPO 與最先進的偏好優化技術。我們的實驗結果證明,LMPO 有效控制回應長度,減少機率下降,並優於現有方法。程式碼可在 \url{https://github.com/gengxuli/LMPO} 取得。 -摘要:隨著醫療保健系統的數位化,人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力,但通常是以透明度和可理解性為代價。這導致人類缺乏信任,從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距,但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估,其中包含了 Grad-CAM 解釋方法,並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用,揭示了實現實際透明度的難度,以及許多參與者希望獲得更深入的解釋。 +##### **How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation** +2502.14642v1 by Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui -##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare** -2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio +Recently, LLMs have garnered increasing attention across academic disciplines +for their potential as human digital twins, virtual proxies designed to +replicate individuals and autonomously perform tasks such as decision-making, +problem-solving, and reasoning on their behalf. However, current evaluations of +LLMs primarily emphasize dialogue simulation while overlooking human behavior +simulation, which is crucial for digital twins. To address this gap, we +introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to +simulate continuous human behavior. BehaviorChain comprises diverse, +high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors +across 1,001 unique personas, each with detailed history and profile metadata. +For evaluation, we integrate persona metadata into LLMs and employ them to +iteratively infer contextually appropriate behaviors within dynamic scenarios +provided by BehaviorChain. Comprehensive evaluation results demonstrated that +even state-of-the-art models struggle with accurately simulating continuous +human behavior. -The integration of Large Language Models (LLMs) into healthcare diagnostics -offers a promising avenue for clinical decision-making. This study outlines the -development of a novel method for zero-shot/few-shot in-context learning (ICL) -by integrating medical domain knowledge using a multi-layered structured -prompt. We also explore the efficacy of two communication styles between the -user and LLMs: the Numerical Conversational (NC) style, which processes data -incrementally, and the Natural Language Single-Turn (NL-ST) style, which -employs long narrative prompts. - Our study systematically evaluates the diagnostic accuracy and risk factors, -including gender bias and false negative rates, using a dataset of 920 patient -records in various few-shot scenarios. Results indicate that traditional -clinical machine learning (ML) models generally outperform LLMs in zero-shot -and few-shot settings. However, the performance gap narrows significantly when -employing few-shot examples alongside effective explainable AI (XAI) methods as -sources of domain knowledge. Moreover, with sufficient time and an increased -number of examples, the conversational style (NC) nearly matches the -performance of ML models. Most notably, LLMs demonstrate comparable or superior -cost-sensitive accuracy relative to ML models. - This research confirms that, with appropriate domain knowledge and tailored -communication strategies, LLMs can significantly enhance diagnostic processes. -The findings highlight the importance of optimizing the number of training -examples and communication styles to improve accuracy and reduce biases in LLM -applications. +摘要:最近,LLM 在各個學科中備受關注,因為它們具有作為人類數位雙胞胎的潛力,也就是虛擬代理人,旨在複製個人並自主執行任務,例如代表他們進行決策、解決問題和推理。然而,LLM 目前的評估主要強調對話模擬,同時忽視了人類行為模擬,這對數位雙胞胎至關重要。為了解決這個差距,我們引入了 BehaviorChain,這是第一個用於評估 LLM 模擬連續人類行為能力的基準。BehaviorChain 包含多樣化、高品質、基於角色的行為鏈,總共涵蓋 1,001 個獨特角色的 15,846 種不同行為,每個角色都有詳細的歷史和個人資料元數據。在評估中,我們將角色元數據整合到 LLM 中,並使用它們在 BehaviorChain 提供的動態場景中反覆推斷出在情境中適當的行為。全面的評估結果表明,即使是最先進的模型在準確模擬連續人類行為方面也存在困難。 -摘要:大型語言模型 (LLM) 與醫療診斷整合 -為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發,用於零次學習/少量學習情境學習 (ICL),方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效:數值對話 (NC) 方式,它會逐步處理資料,以及自然語言單回合 (NL-ST) 方式,它會使用長篇敘事提示。 -我們的研究系統性地評估了診斷準確性和風險因子,包括性別偏見和假陰性率,使用了一個包含 920 個患者記錄的資料集,採用各種少量學習情境。結果表明,傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而,當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時,效能差距會顯著縮小。此外,隨著時間充足和範例數量增加,對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是,LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。 -本研究證實,透過適當的領域知識和量身打造的溝通策略,LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性,以提高準確度並減少 LLM 應用中的偏差。 +##### **NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization** +2502.14638v1 by Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber -##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems** -2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho +Image geo-localization is the task of predicting the specific location of an +image and requires complex reasoning across visual, geographical, and cultural +contexts. While prior Vision Language Models (VLMs) have the best accuracy at +this task, there is a dearth of high-quality datasets and models for analytical +reasoning. We first create NaviClues, a high-quality dataset derived from +GeoGuessr, a popular geography game, to supply examples of expert reasoning +from language. Using this dataset, we present Navig, a comprehensive image +geo-localization framework integrating global and fine-grained image +information. By reasoning with language, Navig reduces the average distance +error by 14% compared to previous state-of-the-art models while requiring fewer +than 1000 training samples. Our dataset and code are available at +https://github.com/SparrowZheyuan18/Navig/. -The increasing reliance on Deep Learning models, combined with their inherent -lack of transparency, has spurred the development of a novel field of study -known as eXplainable AI (XAI) methods. These methods seek to enhance the trust -of end-users in automated systems by providing insights into the rationale -behind their decisions. This paper presents a novel approach for measuring user -trust in XAI systems, allowing their refinement. Our proposed metric combines -both performance metrics and trust indicators from an objective perspective. To -validate this novel methodology, we conducted a case study in a realistic -medical scenario: the usage of XAI system for the detection of pneumonia from -x-ray images. +摘要:影像地理定位是預測影像特定位置的任務,需要跨視覺、地理和文化脈絡進行複雜的推理。雖然先前的視覺語言模型 (VLM) 在此任務中擁有最佳準確度,但缺乏高品質的資料集和分析推理模型。我們首先建立 NaviClues,這是一個源自 GeoGuessr 的高品質資料集,GeoGuessr 是一款流行的地理遊戲,可提供來自語言的專家推理範例。使用此資料集,我們提出 Navig,這是一個綜合性的影像地理定位架構,整合了全球和細緻的影像資訊。透過語言推理,Navig 將平均距離誤差減少了 14%,與先前的最先進模型相比,同時只需要不到 1000 個訓練樣本。我們的資料集和程式碼可在 https://github.com/SparrowZheyuan18/Navig/ 取得。 -摘要:隨著對深度學習模型依賴性的增加,加上其固有的透明度不足,促使一個新的研究領域發展,稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理,來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法,允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法,我們在一個真實的醫療場景中進行了一個案例研究:使用 XAI 系統從 X 光影像中偵測肺炎。 +##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** +2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu -##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19** -2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao +Protein backbone generation plays a central role in de novo protein design +and is significant for many biological and medical applications. Although +diffusion and flow-based generative models provide potential solutions to this +challenging task, they often generate proteins with undesired designability and +suffer computational inefficiency. In this study, we propose a novel rectified +quaternion flow (ReQFlow) matching method for fast and high-quality protein +backbone generation. In particular, our method generates a local translation +and a 3D rotation from random noise for each residue in a protein chain, which +represents each 3D rotation as a unit quaternion and constructs its flow by +spherical linear interpolation (SLERP) in an exponential format. We train the +model by quaternion flow (QFlow) matching with guaranteed numerical stability +and rectify the QFlow model to accelerate its inference and improve the +designability of generated protein backbones, leading to the proposed ReQFlow +model. Experiments show that ReQFlow achieves state-of-the-art performance in +protein backbone generation while requiring much fewer sampling steps and +significantly less inference time (e.g., being 37x faster than RFDiffusion and +62x faster than Genie2 when generating a backbone of length 300), demonstrating +its effectiveness and efficiency. The code is available at +https://github.com/AngxiaoYue/ReQFlow. -The COVID-19 pandemic has strained global public health, necessitating -accurate diagnosis and intervention to control disease spread and reduce -mortality rates. This paper introduces an interpretable deep survival -prediction model designed specifically for improved understanding and trust in -COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale -pretrained image encoder, Risk-specific Grad-CAM, and anatomical region -detection techniques, our approach produces regional interpretable outcomes -that effectively capture essential disease features while focusing on rare but -critical abnormal regions. Our model's predictive results provide enhanced -clarity and transparency through risk area localization, enabling clinicians to -make informed decisions regarding COVID-19 diagnosis with better understanding -of prognostic insights. We evaluate the proposed method on a multi-center -survival dataset and demonstrate its effectiveness via quantitative and -qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and -time-dependent AUCs (0.799 and 0.691). These results suggest that our -explainable deep survival prediction model surpasses traditional survival -analysis methods in risk prediction, improving interpretability for clinical -decision making and enhancing AI system trustworthiness. +摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 -摘要:COVID-19 疫情對全球公共衛生造成壓力,必須進行準確的診斷和干預,以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型,專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術,我們的做法產生區域可解釋的結果,有效捕捉必要的疾病特徵,同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度,讓臨床醫生能夠在更了解預後見解的情況下,就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法,並透過量化和質化評估證明其有效性,達到優異的 C 指數(0.764 和 0.727)和時間相關 AUC(0.799 和 0.691)。這些結果表明,我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法,提升臨床決策的解釋性,並增強 AI 系統的信賴度。 +##### **PEARL: Towards Permutation-Resilient LLMs** +2502.14628v1 by Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong -##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics** -2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile +The in-context learning (ICL) capability of large language models (LLMs) +enables them to perform challenging tasks using provided demonstrations. +However, ICL is highly sensitive to the ordering of demonstrations, leading to +instability in predictions. This paper shows that this vulnerability can be +exploited to design a natural attack - difficult for model providers to detect +- that achieves nearly 80% success rate on LLaMA-3 by simply permuting the +demonstrations. Existing mitigation methods primarily rely on post-processing +and fail to enhance the model's inherent robustness to input permutations, +raising concerns about safety and reliability of LLMs. To address this issue, +we propose Permutation-resilient learning (PEARL), a novel framework based on +distributionally robust optimization (DRO), which optimizes model performance +against the worst-case input permutation. Specifically, PEARL consists of a +permutation-proposal network (P-Net) and the LLM. The P-Net generates the most +challenging permutations by treating it as an optimal transport problem, which +is solved using an entropy-constrained Sinkhorn algorithm. Through minimax +optimization, the P-Net and the LLM iteratively optimize against each other, +progressively improving the LLM's robustness. Experiments on synthetic +pre-training and real-world instruction tuning tasks demonstrate that PEARL +effectively mitigates permutation attacks and enhances performance. Notably, +despite being trained on fewer shots and shorter contexts, PEARL achieves +performance gains of up to 40% when scaled to many-shot and long-context +scenarios, highlighting its efficiency and generalization capabilities. -In recent years, machine learning-based clinical decision support systems -(CDSS) have played a key role in the analysis of several medical conditions. -Despite their promising capabilities, the lack of transparency in AI models -poses significant challenges, particularly in medical contexts where -reliability is a mandatory aspect. However, it appears that explainability is -inversely proportional to accuracy. For this reason, achieving transparency -without compromising predictive accuracy remains a key challenge. This paper -presents a novel method, namely Rad4XCNN, to enhance the predictive power of -CNN-derived features with the inherent interpretability of radiomic features. -Rad4XCNN diverges from conventional methods based on saliency maps, by -associating intelligible meaning to CNN-derived features by means of Radiomics, -offering new perspectives on explanation methods beyond visualization maps. -Using a breast cancer classification task as a case study, we evaluated -Rad4XCNN on ultrasound imaging datasets, including an online dataset and two -in-house datasets for internal and external validation. Some key results are: -i) CNN-derived features guarantee more robust accuracy when compared against -ViT-derived and radiomic features; ii) conventional visualization map methods -for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice -model accuracy for their explainability; iv) Rad4XCNN provides a global -explanation enabling the physician to extract global insights and findings. Our -method can mitigate some concerns related to the explainability-accuracy -trade-off. This study highlighted the importance of proposing new methods for -model explanation without affecting their accuracy. +摘要:大型語言模型 (LLM) 的語境學習 (ICL) 能力使其能夠透過提供的示範來執行具有挑戰性的任務。然而,ICL 對示範的排序非常敏感,導致預測不穩定。本文顯示,可以利用此漏洞來設計一種自然攻擊,讓模型提供者難以偵測,透過簡單地排列示範,在 LLaMA-3 上達到近 80% 的成功率。現有的緩解方法主要依賴後處理,且無法增強模型對輸入排列的固有穩健性,引發了對 LLM 的安全性與可靠性的疑慮。為了解決此問題,我們提出了一種基於分配穩健最佳化 (DRO) 的新型架構,稱為排列彈性學習 (PEARL),它針對最差情況的輸入排列來最佳化模型效能。具體來說,PEARL 包含排列建議網路 (P-Net) 和 LLM。P-Net 將其視為最優傳輸問題來產生最具挑戰性的排列,並使用熵約束 Sinkhorn 演算法來解決。透過極小極大最佳化,P-Net 和 LLM 迭代地相互最佳化,逐步改善 LLM 的穩健性。在合成預訓練和真實世界指令調整任務上的實驗證明,PEARL 有效地減輕了排列攻擊並增強了效能。值得注意的是,儘管在較少的次數和較短的語境中進行訓練,但 PEARL 在擴展到多重次數和長語境場景時仍可獲得高達 40% 的效能提升,突顯了其效率和泛化能力。 + +##### **ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors** +2502.14627v1 by Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou + +Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to +retrieve audio clips or multilingual texts from databases. However, existing +ML-ATR schemes suffer from inconsistencies for instance similarity matching +across languages. We theoretically analyze the inconsistency in terms of both +multilingual modal alignment direction error and weight error, and propose the +theoretical weight error upper bound for quantifying the inconsistency. Based +on the analysis of the weight error upper bound, we find that the inconsistency +problem stems from the data distribution error caused by random sampling of +languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive +learning and audio-English co-anchor contrastive learning, aiming to mitigate +the negative impact of data distribution error on recall and consistency in +ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets +show that our scheme achieves state-of-the-art performance on recall and +consistency metrics for eight mainstream languages, including English. Our code +will be available at https://github.com/ATRI-ACL/ATRI-ACL. -摘要:近年来,基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景,但 AI 模型缺乏透明度,尤其在医疗领域,可靠性是强制性方面,这带来了重大挑战。然而,解释性似乎与准确性成反比。因此,在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法,即 Rad4XCNN,以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来,从而偏离了基于显着性图的传统方法,为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究,我们在超声成像数据集上评估了 Rad4XCNN,包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是:i) 与 ViT 衍生和放射组学特征相比,CNN 衍生特征保证了更稳健的准确性;ii) 用于解释的传统可视化图方法存在一些缺陷;iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性;iv) Rad4XCNN 提供全局解释,使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。 +摘要:多模態多語言音訊文字檢索 (ML-ATR) 是一項具有挑戰性的任務,旨在從資料庫中檢索音訊片段或多語言文字。然而,現有的 ML-ATR 架構存在不一致的情況,例如跨語言的相似性比對。我們在理論上分析了不一致性,包括多模態多語言對齊方向誤差和權重誤差,並提出理論權重誤差上限以量化不一致性。根據權重誤差上限的分析,我們發現不一致性問題源於由語言隨機取樣造成的資料分佈誤差。我們提出一個一致的 ML-ATR 架構,採用 1 對 k 對比學習和音訊-英語共同錨點對比學習,旨在減輕資料分佈誤差對 ML-ATR 中召回率和一致性的負面影響。在已翻譯的 AudioCaps 和 Clotho 資料集上的實驗結果顯示,我們的架構在包括英語在內的八種主流語言的召回率和一致性指標上達到了最先進的效能。我們的程式碼將在 https://github.com/ATRI-ACL/ATRI-ACL 中提供。 -##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability** -2404.16957v1 by Yunfei Ge, Quanyan Zhu +##### **Multi-Record Web Page Information Extraction From News Websites** +2502.14625v1 by Alexander Kustenkov, Maksim Varlamov, Alexander Yatskov -The pervasive integration of Artificial Intelligence (AI) has introduced -complex challenges in the responsibility and accountability in the event of -incidents involving AI-enabled systems. The interconnectivity of these systems, -ethical concerns of AI-induced incidents, coupled with uncertainties in AI -technology and the absence of corresponding regulations, have made traditional -responsibility attribution challenging. To this end, this work proposes a -Computational Reflective Equilibrium (CRE) approach to establish a coherent and -ethically acceptable responsibility attribution framework for all stakeholders. -The computational approach provides a structured analysis that overcomes the -limitations of conceptual approaches in dealing with dynamic and multifaceted -scenarios, showcasing the framework's explainability, coherence, and adaptivity -properties in the responsibility attribution process. We examine the pivotal -role of the initial activation level associated with claims in equilibrium -computation. Using an AI-assisted medical decision-support system as a case -study, we illustrate how different initializations lead to diverse -responsibility distributions. The framework offers valuable insights into -accountability in AI-induced incidents, facilitating the development of a -sustainable and resilient system through continuous monitoring, revision, and -reflection. +In this paper, we focused on the problem of extracting information from web +pages containing many records, a task of growing importance in the era of +massive web data. Recently, the development of neural network methods has +improved the quality of information extraction from web pages. Nevertheless, +most of the research and datasets are aimed at studying detailed pages. This +has left multi-record "list pages" relatively understudied, despite their +widespread presence and practical significance. + To address this gap, we created a large-scale, open-access dataset +specifically designed for list pages. This is the first dataset for this task +in the Russian language. Our dataset contains 13,120 web pages with news lists, +significantly exceeding existing datasets in both scale and complexity. Our +dataset contains attributes of various types, including optional and +multi-valued, providing a realistic representation of real-world list pages. +These features make our dataset a valuable resource for studying information +extraction from pages containing many records. + Furthermore, we proposed our own multi-stage information extraction methods. +In this work, we explore and demonstrate several strategies for applying +MarkupLM to the specific challenges of multi-record web pages. Our experiments +validate the advantages of our methods. + By releasing our dataset to the public, we aim to advance the field of +information extraction from multi-record pages. -摘要:隨著人工智慧 (AI) 的普及整合,在涉及 AI 驅動系統的事故中,責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題,加上 AI 技術的不確定性和缺乏相應法規,使得傳統責任歸屬面臨挑戰。為此,本研究提出了一種計算反思均衡 (CRE) 方法,以建立一個連貫且在倫理上可接受的責任歸屬架構,適用於所有利害關係人。計算方法提供了結構化的分析,克服了概念方法在處理動態且多面向情境時的限制,展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究,說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解,透過持續監控、修訂和反思,促進了永續且有韌性的系統發展。 +摘要:在本文中,我們專注於從包含大量記錄的網頁中提取資訊的問題,這項任務在海量網路資料的時代中越來越重要。最近,神經網路方法的發展已改善從網頁中提取資訊的品質。儘管如此,大多數的研究和資料集都旨在研究詳細的網頁。儘管多記錄「清單網頁」廣泛存在且具有實用意義,但它們相對來說研究較少。 +為了解決這個差距,我們建立了一個專門針對清單網頁設計的大規模、開放存取的資料集。這是俄語中第一個針對此任務的資料集。我們的資料集包含 13,120 個包含新聞清單的網頁,在規模和複雜度上都遠遠超過現有的資料集。我們的資料集包含各種類型的屬性,包括可選和多值,提供真實世界清單網頁的實際表示。這些特點使我們的資料集成為研究從包含大量記錄的網頁中提取資訊的寶貴資源。 +此外,我們提出了我們自己的多階段資訊提取方法。在這項工作中,我們探討並展示了將 MarkupLM 應用於多記錄網頁特定挑戰的幾種策略。我們的實驗驗證了我們方法的優點。 +透過向公眾發布我們的資料集,我們旨在推進從多記錄網頁中提取資訊的領域。 -##### **Explainable AI for Fair Sepsis Mortality Predictive Model** -2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang +##### **Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity** +2502.14620v1 by Xinghan Pan -Artificial intelligence supports healthcare professionals with predictive -modeling, greatly transforming clinical decision-making. This study addresses -the crucial need for fairness and explainability in AI applications within -healthcare to ensure equitable outcomes across diverse patient demographics. By -focusing on the predictive modeling of sepsis-related mortality, we propose a -method that learns a performance-optimized predictive model and then employs -the transfer learning process to produce a model with better fairness. Our -method also introduces a novel permutation-based feature importance algorithm -aiming at elucidating the contribution of each feature in enhancing fairness on -predictions. Unlike existing explainability methods concentrating on explaining -feature contribution to predictive performance, our proposed method uniquely -bridges the gap in understanding how each feature contributes to fairness. This -advancement is pivotal, given sepsis's significant mortality rate and its role -in one-third of hospital deaths. Our method not only aids in identifying and -mitigating biases within the predictive model but also fosters trust among -healthcare stakeholders by improving the transparency and fairness of model -predictions, thereby contributing to more equitable and trustworthy healthcare -delivery. +This paper investigates the efficacy of RWKV, a novel language model +architecture known for its linear attention mechanism, for generating sentence +embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate +the semantic similarity captured by embeddings from different hidden layers of +a pre-trained RWKV model. The performance is assessed on the Microsoft Research +Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared +against a GloVe-based baseline. My results indicate that while RWKV embeddings +capture some semantic relatedness, they underperform compared to the GloVe +baseline in terms of Spearman correlation. I also analyze the inference time +and GPU memory usage, highlighting the computational trade-offs associated with +RWKV embeddings. The findings suggest that while RWKV offers potential +advantages in terms of linear scaling, its zero-shot sentence embedding quality +for semantic similarity tasks requires further investigation and potential +task-specific fine-tuning to match or exceed simpler baselines. -摘要:人工智慧透過預測模型協助醫療專業人員,大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求,以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型,我們提出了一種方法,該方法會學習一個效能最佳化的預測模型,然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法,旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同,我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要,因為敗血症的死亡率很高,且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差,還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任,進而有助於提供更公平且值得信賴的醫療保健服務。 +摘要:本文探討 RWKV 的效能,這是一種以線性注意力機制聞名的語言模型架構,可用於在零次學習設定中產生句子嵌入。我進行逐層分析,以評估預先訓練的 RWKV 模型中不同隱藏層的嵌入所擷取的語義相似性。效能評估使用 Microsoft Research Paraphrase Corpus (MRPC) 資料集,採用 Spearman 相關係數,並與基於 GloVe 的基準進行比較。我的結果顯示,雖然 RWKV 嵌入可以擷取一些語義相關性,但與 GloVe 基準相比,在 Spearman 相關係數方面表現不佳。我也分析了推論時間和 GPU 記憶體使用量,強調與 RWKV 嵌入相關的運算折衷。這些發現表明,雖然 RWKV 在線性縮放方面具有潛在優勢,但其在語義相似性任務中的零次學習句子嵌入品質需要進一步探討,並需要潛在的特定任務微調,才能達到或超越較簡單的基準。 -##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence** -2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal +##### **Reward Models Identify Consistency, Not Causality** +2502.14619v1 by Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li -Depression is a significant issue nowadays. As per the World Health -Organization (WHO), in 2023, over 280 million individuals are grappling with -depression. This is a huge number; if not taken seriously, these numbers will -increase rapidly. About 4.89 billion individuals are social media users. People -express their feelings and emotions on platforms like Twitter, Facebook, -Reddit, Instagram, etc. These platforms contain valuable information which can -be used for research purposes. Considerable research has been conducted across -various social media platforms. However, certain limitations persist in these -endeavors. Particularly, previous studies were only focused on detecting -depression and the intensity of depression in tweets. Also, there existed -inaccuracies in dataset labeling. In this research work, five types of -depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted -using tweets from the Twitter database based on lexicon labeling. Explainable -AI was used to provide reasoning by highlighting the parts of tweets that -represent type of depression. Bidirectional Encoder Representations from -Transformers (BERT) was used for feature extraction and training. Machine -learning and deep learning methodologies were used to train the model. The BERT -model presented the most promising results, achieving an overall accuracy of -0.96. +Reward models (RMs) play a crucial role in aligning large language models +(LLMs) with human preferences and enhancing reasoning quality. Traditionally, +RMs are trained to rank candidate outputs based on their correctness and +coherence. However, in this work, we present several surprising findings that +challenge common assumptions about RM behavior. Our analysis reveals that +state-of-the-art reward models prioritize structural consistency over causal +correctness. Specifically, removing the problem statement has minimal impact on +reward scores, whereas altering numerical values or disrupting the reasoning +flow significantly affects RM outputs. Furthermore, RMs exhibit a strong +dependence on complete reasoning trajectories truncated or incomplete steps +lead to significant variations in reward assignments, indicating that RMs +primarily rely on learned reasoning patterns rather than explicit problem +comprehension. These findings hold across multiple architectures, datasets, and +tasks, leading to three key insights: (1) RMs primarily assess coherence rather +than true reasoning quality; (2) The role of explicit problem comprehension in +reward assignment is overstated; (3) Current RMs may be more effective at +ranking responses than verifying logical validity. Our results suggest a +fundamental limitation in existing reward modeling approaches, emphasizing the +need for a shift toward causality-aware reward models that go beyond +consistency-driven evaluation. -摘要:現今,憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料,在 2023 年,超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字;如果不認真看待,這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊,可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而,這些努力仍存在某些限制。特別是,先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外,資料集標籤中存在不準確的情況。在這項研究工作中,使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症(雙極型、重度、精神病型、非典型和產後)。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers(BERT)中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果,達到 0.96 的整體準確度。 +摘要:獎勵模型 (RM) 在將大型語言模型 (LLM) 與人類偏好對齊並提升推理品質方面扮演至關重要的角色。傳統上,RM 會訓練來根據候選輸出的正確性和一致性進行排名。然而,在這項工作中,我們提出幾個令人驚訝的發現,挑戰了關於 RM 行為的常見假設。我們的分析顯示,最先進的獎勵模型優先考慮結構一致性,而不是因果正確性。具體來說,移除問題陳述對獎勵分數的影響很小,而改變數值或中斷推理流程則會顯著影響 RM 輸出。此外,RM 表現出對完整推理軌跡的強烈依賴性,截斷或不完整的步驟會導致獎勵分配產生重大變化,這表示 RM 主要依賴於學習到的推理模式,而不是明確的問題理解。這些發現適用於多種架構、資料集和任務,得出三個關鍵見解:(1) RM 主要評估一致性,而不是真正的推理品質;(2) 在獎勵分配中,明確問題理解的角色被誇大了;(3) 目前的 RM 在排名回應方面可能比驗證邏輯有效性更有效。我們的結果表明現有獎勵建模方法存在根本限制,強調需要轉向因果感知獎勵模型,超越以一致性為導向的評估。 -##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images** -2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman +##### **FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis** +2502.14614v1 by Mingyi Jia, Junwen Duan, Yan Song, Jianxin Wang -Deep learning is dramatically transforming the field of medical imaging and -radiology, enabling the identification of pathologies in medical images, -including computed tomography (CT) and X-ray scans. However, the performance of -deep learning models, particularly in segmentation tasks, is often limited by -the need for extensive annotated datasets. To address this challenge, the -capabilities of weakly supervised semantic segmentation are explored through -the lens of Explainable AI and the generation of counterfactual explanations. -The scope of this research is development of a novel counterfactual inpainting -approach (COIN) that flips the predicted classification label from abnormal to -normal by using a generative model. For instance, if the classifier deems an -input medical image X as abnormal, indicating the presence of a pathology, the -generative model aims to inpaint the abnormal region, thus reversing the -classifier's original prediction label. The approach enables us to produce -precise segmentations for pathologies without depending on pre-existing -segmentation masks. Crucially, image-level labels are utilized, which are -substantially easier to acquire than creating detailed segmentation masks. The -effectiveness of the method is demonstrated by segmenting synthetic targets and -actual kidney tumors from CT images acquired from Tartu University Hospital in -Estonia. The findings indicate that COIN greatly surpasses established -attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an -alternative counterfactual explanation method introduced by Singla et al. This -evidence suggests that COIN is a promising approach for semantic segmentation -of tumors in CT images, and presents a step forward in making deep learning -applications more accessible and effective in healthcare, where annotated data -is scarce. +Retrieval-Augmented Large Language Models (LLMs), which integrate external +knowledge into LLMs, have shown remarkable performance in various medical +domains, including clinical diagnosis. However, existing RAG methods struggle +to effectively assess task difficulty to make retrieval decisions, thereby +failing to meet the clinical requirements for balancing efficiency and +accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained +\textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework +that improves the reliability of RAG in disease diagnosis scenarios. FIND +incorporates a fine-grained adaptive control module to determine whether +retrieval is necessary based on the information density of the input. By +optimizing the retrieval process and implementing a knowledge filtering module, +FIND ensures that the retrieval is better suited to clinical scenarios. +Experiments on three Chinese electronic medical record datasets demonstrate +that FIND significantly outperforms various baseline methods, highlighting its +effectiveness in clinical diagnosis tasks. -摘要:深度学习正大幅轉變醫學影像和放射線學領域,能辨識醫學影像中的病理,包括電腦斷層掃描 (CT) 和 X 光掃描。然而,深度學習模型的效能,特別是在分割任務中,常常受到廣泛註解資料集需求的限制。為了應對此挑戰,透過可解釋 AI 和反事實解釋的產生,探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN),該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如,如果分類器將輸入的醫學影像 X 視為異常,表示存在病理,則生成模型旨在內插異常區域,從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割,而無需依賴於預先存在的分割遮罩。至關重要的是,利用影像層級標籤,這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明,COIN 遠遠超過已建立的歸因方法,例如 RISE、ScoreCAM 和 LayerCAM,以及 Singla 等人提出的另一種反事實解釋方法。此證據表明,COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法,並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步,其中註解資料很稀少。 +摘要:檢索增強大型語言模型 (LLM),將外部知識整合至 LLM,已於各種醫療領域展現出卓越效能,包括臨床診斷。然而,現有的 RAG 方法難以有效評估任務難度以做出檢索決策,因此無法滿足平衡效率和精確度的臨床需求。因此,我們在本文中提出 FIND(**F**ine-grained **In**formation **D**ensity Guided Adaptive RAG),一種新穎架構,可提升 RAG 在疾病診斷場景中的可靠性。FIND 整合一個細緻化的自適應控制模組,根據輸入的資訊密度判斷是否需要檢索。透過最佳化檢索程序並實作一個知識過濾模組,FIND 確保檢索更適合臨床場景。在三個中文電子病歷資料集上的實驗顯示,FIND 明顯優於各種基線方法,突顯其在臨床診斷任務中的有效性。 -##### **Hybrid Intelligence for Digital Humanities** -2406.15374v1 by Victor de Boer, Lise Stork +##### **Behavioral Analysis of Information Salience in Large Language Models** +2502.14613v1 by Jan Trienes, Jörg Schlötterer, Junyi Jessy Li, Christin Seifert -In this paper, we explore the synergies between Digital Humanities (DH) as a -discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research, -the use of digital methods and specifically that of Artificial Intelligence is -subject to a set of requirements and constraints. We argue that these are -well-supported by the capabilities and goals of HI. Our contribution includes -the identification of five such DH requirements: Successful AI systems need to -be able to 1) collaborate with the (human) scholar; 2) support data criticism; -3) support tool criticism; 4) be aware of and cater to various perspectives and -5) support distant and close reading. We take the CARE principles of Hybrid -Intelligence (collaborative, adaptive, responsible and explainable) as -theoretical framework and map these to the DH requirements. In this mapping, we -include example research projects. We finally address how insights from DH can -be applied to HI and discuss open challenges for the combination of the two -disciplines. +Large Language Models (LLMs) excel at text summarization, a task that +requires models to select content based on its importance. However, the exact +notion of salience that LLMs have internalized remains unclear. To bridge this +gap, we introduce an explainable framework to systematically derive and +investigate information salience in LLMs through their summarization behavior. +Using length-controlled summarization as a behavioral probe into the content +selection process, and tracing the answerability of Questions Under Discussion +throughout, we derive a proxy for how models prioritize information. Our +experiments on 13 models across four datasets reveal that LLMs have a nuanced, +hierarchical notion of salience, generally consistent across model families and +sizes. While models show highly consistent behavior and hence salience +patterns, this notion of salience cannot be accessed through introspection, and +only weakly correlates with human perceptions of information salience. -摘要:在本文中,我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中,數位方法的使用,特別是人工智慧的使用,受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求:成功的 AI 系統需要能夠 1) 與(人類)學者合作;2) 支援資料批評;3) 支援工具批評;4) 察覺並迎合各種觀點;5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則(協作、適應、負責和可解釋)作為理論架構,並將這些原則對應到 DH 要求。在此對應中,我們納入範例研究專案。最後,我們探討如何將 DH 的見解應用於 HI,並討論結合這兩個學科的開放挑戰。 +摘要:大型語言模型 (LLM) 在文字摘要方面表現出色,這項任務需要模型根據重要性來選擇內容。然而,LLM 內化的顯著性準確概念仍不清楚。為了彌補這個差距,我們引入了一個可解釋的架構,透過摘要行為系統性地推導和調查 LLM 中的資訊顯著性。使用長度控制摘要作為行為探測來探討內容選擇過程,並追蹤討論中問題的可回答性,我們推導出一個模型優先處理資訊的方式代理。我們針對四個資料集中的 13 個模型進行的實驗揭示,LLM 具有細緻入微、階層式的顯著性概念,通常在模型系列和大小之間保持一致。雖然模型表現出高度一致的行為,因此具有顯著性模式,但這個顯著性概念無法透過內省來存取,而且與人類對資訊顯著性的認知僅有微弱相關性。 -##### **Ethical Framework for Responsible Foundational Models in Medical Imaging** -2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci +##### **A Theory for Conditional Generative Modeling on Multiple Data Sources** +2502.14583v1 by Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu -Foundational models (FMs) have tremendous potential to revolutionize medical -imaging. However, their deployment in real-world clinical settings demands -extensive ethical considerations. This paper aims to highlight the ethical -concerns related to FMs and propose a framework to guide their responsible -development and implementation within medicine. We meticulously examine ethical -issues such as privacy of patient data, bias mitigation, algorithmic -transparency, explainability and accountability. The proposed framework is -designed to prioritize patient welfare, mitigate potential risks, and foster -trust in AI-assisted healthcare. +The success of large generative models has driven a paradigm shift, +leveraging massive multi-source data to enhance model capabilities. However, +the interaction among these sources remains theoretically underexplored. This +paper takes the first step toward a rigorous analysis of multi-source training +in conditional generative modeling, where each condition represents a distinct +data source. Specifically, we establish a general distribution estimation error +bound in average total variation distance for conditional maximum likelihood +estimation based on the bracketing number. Our result shows that when source +distributions share certain similarities and the model is expressive enough, +multi-source training guarantees a sharper bound than single-source training. +We further instantiate the general theory on conditional Gaussian estimation +and deep generative models including autoregressive and flexible energy-based +models, by characterizing their bracketing numbers. The results highlight that +the number of sources and similarity among source distributions improve the +advantage of multi-source training. Simulations and real-world experiments +validate our theory. Code is available at: +\url{https://github.com/ML-GSAI/Multi-Source-GM}. -摘要:基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而,它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題,並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題,例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險,並培養對 AI 輔助醫療保健的信任。 +摘要:大型生成模型的成功推動了範例轉移,利用大量多來源資料來增強模型功能。然而,這些來源之間的互動在理論上仍未得到充分探討。本文踏出了嚴謹分析條件生成模型中多來源訓練的第一步,其中每個條件代表一個不同的資料來源。具體來說,我們建立了一個基於括號數的條件最大似然估計的平均總變異距離中的通用分佈估計誤差界限。我們的結果表明,當來源分佈具有一定的相似性且模型具有足夠的表達力時,多來源訓練保證了比單來源訓練更嚴格的界限。我們進一步在條件高斯估計和深度生成模型(包括自迴歸和靈活的基於能量的模型)上例證了通用理論,通過表徵它們的括號數。結果強調了來源數和來源分佈之間的相似性提高了多來源訓練的優勢。模擬和真實世界的實驗驗證了我們的理論。程式碼可在以下網址取得:\url{https://github.com/ML-GSAI/Multi-Source-GM}。 -##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis** -2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak +##### **A Statistical Case Against Empirical Human-AI Alignment** +2502.14581v1 by Julian Rodemann, Esteban Garces Arias, Christoph Luther, Christoph Jansen, Thomas Augustin -Thyroid cancer is an increasing global health concern that requires advanced -diagnostic methods. The application of AI and radiomics to thyroid cancer -diagnosis is examined in this review. A review of multiple databases was -conducted in compliance with PRISMA guidelines until October 2023. A -combination of keywords led to the discovery of an English academic publication -on thyroid cancer and related subjects. 267 papers were returned from the -original search after 109 duplicates were removed. Relevant studies were -selected according to predetermined criteria after 124 articles were eliminated -based on an examination of their abstract and title. After the comprehensive -analysis, an additional six studies were excluded. Among the 28 included -studies, radiomics analysis, which incorporates ultrasound (US) images, -demonstrated its effectiveness in diagnosing thyroid cancer. Various results -were noted, some of the studies presenting new strategies that outperformed the -status quo. The literature has emphasized various challenges faced by AI -models, including interpretability issues, dataset constraints, and operator -dependence. The synthesized findings of the 28 included studies mentioned the -need for standardization efforts and prospective multicenter studies to address -these concerns. Furthermore, approaches to overcome these obstacles were -identified, such as advances in explainable AI technology and personalized -medicine techniques. The review focuses on how AI and radiomics could transform -the diagnosis and treatment of thyroid cancer. Despite challenges, future -research on multidisciplinary cooperation, clinical applicability validation, -and algorithm improvement holds the potential to improve patient outcomes and -diagnostic precision in the treatment of thyroid cancer. +Empirical human-AI alignment aims to make AI systems act in line with +observed human behavior. While noble in its goals, we argue that empirical +alignment can inadvertently introduce statistical biases that warrant caution. +This position paper thus advocates against naive empirical alignment, offering +prescriptive alignment and a posteriori empirical alignment as alternatives. We +substantiate our principled argument by tangible examples like human-centric +decoding of language models. -摘要:甲狀腺癌是一種日益嚴重的全球健康問題,需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下,對多個資料庫進行了回顧,直到 2023 年 10 月。通過結合關鍵字,發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後,原始搜尋共回傳 267 篇論文。在根據預先確定的標準,淘汰了 124 篇文章的摘要和標題後,選出了相關研究。在進行全面分析後,額外排除了六項研究。在納入的 28 項研究中,結合超音波 (US) 影像的放射特徵分析,證明了其在診斷甲狀腺癌方面的有效性。研究結果不一,有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰,包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到,需要標準化工作和前瞻性多中心研究來解決這些問題。此外,還確定了克服這些障礙的方法,例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰,但未來對多學科合作、臨床適用性驗證和演算法改進的研究,仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。 +摘要:經驗主義的人工智慧校準旨在使人工智慧系統根據觀察到的人類行為採取行動。儘管目標崇高,我們認為經驗主義校準可能會無意中引入需要謹慎對待的統計偏差。因此,本立場文件主張反對天真的經驗主義校準,提供規範性校準和後驗經驗主義校準作為替代方案。我們以具體的例子(例如以人為中心的語言模型解碼)來證明我們的原則性論點。 -##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI** -2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia +##### **ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification** +2502.14565v1 by Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack -Breast cancer has rapidly increased in prevalence in recent years, making it -one of the leading causes of mortality worldwide. Among all cancers, it is by -far the most common. Diagnosing this illness manually requires significant time -and expertise. Since detecting breast cancer is a time-consuming process, -preventing its further spread can be aided by creating machine-based forecasts. -Machine learning and Explainable AI are crucial in classification as they not -only provide accurate predictions but also offer insights into how the model -arrives at its decisions, aiding in the understanding and trustworthiness of -the classification results. In this study, we evaluate and compare the -classification accuracy, precision, recall, and F-1 scores of five different -machine learning methods using a primary dataset (500 patients from Dhaka -Medical College Hospital). Five different supervised machine learning -techniques, including decision tree, random forest, logistic regression, naive -bayes, and XGBoost, have been used to achieve optimal results on our dataset. -Additionally, this study applied SHAP analysis to the XGBoost model to -interpret the model's predictions and understand the impact of each feature on -the model's output. We compared the accuracy with which several algorithms -classified the data, as well as contrasted with other literature in this field. -After final evaluation, this study found that XGBoost achieved the best model -accuracy, which is 97%. +Self-awareness, i.e., the ability to assess and correct one's own generation, +is a fundamental aspect of human intelligence, making its replication in large +language models (LLMs) an important yet challenging task. Previous works tackle +this by employing extensive reinforcement learning or rather relying on large +external verifiers. In this work, we propose Refine via Intrinsic +Self-Verification (ReVISE), an efficient and effective framework that enables +LLMs to self-correct their outputs through self-verification. The core idea of +ReVISE is to enable LLMs to verify their reasoning processes and continually +rethink reasoning trajectories based on its verification. We introduce a +structured curriculum based upon online preference learning to implement this +efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., +self-verification and reasoning correction), we tackle each task sequentially +using curriculum learning, collecting both failed and successful reasoning +paths to construct preference pairs for efficient training. During inference, +our approach enjoys natural test-time scaling by integrating self-verification +and correction capabilities, further enhanced by our proposed confidence-aware +decoding mechanism. Our experiments on various reasoning tasks demonstrate that +ReVISE achieves efficient self-correction and significantly improves reasoning +performance. -摘要:近年來,乳癌的盛行率迅速增加,使其成為全球主要的死亡原因之一。在所有癌症中,乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時,因此透過建立機器學習模型來預測,有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要,因為它們不僅可以提供準確的預測,還可以深入了解模型如何做出決策,有助於理解和信賴分類結果。在此研究中,我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數,使用了一個主要的資料集(達卡醫學院醫院的 500 名患者)。五種不同的監督式機器學習技術,包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost,已用於在我們的資料集上取得最佳結果。此外,本研究將 SHAP 分析應用於 XGBoost 模型,以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度,並與該領域的其他文獻進行對比。在最後評估後,本研究發現 XGBoost 達到了最佳的模型準確度,為 97%。 +摘要:自我覺察,亦即評估和修正自身產出的能力,是人類智慧的基本面向,使其能在大型語言模型 (LLM) 中複製,是一項重要且具挑戰性的任務。先前的研究透過採用廣泛的強化學習或依賴大型外部驗證器來解決這個問題。在這項研究中,我們提出透過內在自我驗證 (ReVISE) 進行精煉,一個有效率且有效的架構,使 LLM 能透過自我驗證來自我修正其產出。ReVISE 的核心概念是讓 LLM 能驗證其推理過程,並根據驗證結果持續重新思考推理軌跡。我們導入一個建構於線上偏好學習的結構化課程,以有效率地實作這項功能。具體來說,由於 ReVISE 涉及兩項具有挑戰性的任務(即自我驗證和推理修正),我們使用課程學習循序漸進地處理每一項任務,收集失敗和成功的推理路徑,以建構偏好對,進行有效率的訓練。在推論期間,我們的作法透過整合自我驗證和修正功能,享有自然的測試時間擴充,並進一步透過我們提出的具備信心感知的解碼機制進行強化。我們在各種推理任務上的實驗顯示,ReVISE 達到有效率的自我修正,並顯著提升推理效能。 -##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI** -2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir +##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** +2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao -The Deep learning (DL) models for diagnosing breast cancer from mammographic -images often operate as "black boxes", making it difficult for healthcare -professionals to trust and understand their decision-making processes. The -study presents an integrated framework combining Convolutional Neural Networks -(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis -of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an -elaborate data preprocessing pipeline and advanced data augmentation techniques -to counteract dataset limitations and transfer learning using pre-trained -networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of -our study is the evaluation of XAI's effectiveness in interpreting model -predictions, highlighted by utilizing the Hausdorff measure to assess the -alignment between AI-generated explanations and expert annotations -quantitatively. This approach is critical for XAI in promoting trustworthiness -and ethical fairness in AI-assisted diagnostics. The findings from our research -illustrate the effective collaboration between CNNs and XAI in advancing -diagnostic methods for breast cancer, thereby facilitating a more seamless -integration of advanced AI technologies within clinical settings. By enhancing -the interpretability of AI driven decisions, this work lays the groundwork for -improved collaboration between AI systems and medical practitioners, ultimately -enriching patient care. Furthermore, the implications of our research extended -well beyond the current methodologies. It encourages further research into how -to combine multimodal data and improve AI explanations to meet the needs of -clinical practice. +Large Language Models (LLMs) have demonstrated exceptional abilities in +reasoning for task planning. However, challenges remain under-explored for +parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in +which the model first decomposes a real-life textual task into executable +subtasks and constructs an abstract task graph. The model then understands this +task graph as input and generates a plan for parallel execution. To enhance the +planning capability of complex, scalable graphs, we design an automated and +controllable pipeline to generate synthetic graphs and propose a two-stage +training scheme. Experimental results show that our plan-over-graph method +significantly improves task performance on both API-based LLMs and trainable +open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally +supports parallel execution, demonstrating global efficiency. The code and data +are available at https://github.com/zsq259/Plan-over-Graph. -摘要:深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作,這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構,結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI),以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術,以對抗資料集限制,並採用預先訓練的網路(例如 VGG-16、Inception-V3 和 ResNet)進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性,重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作,從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性,這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎,最終豐富了患者照護。此外,我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋,以滿足臨床實務的需求。 +摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 -##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives** -2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu +##### **Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs** +2502.14561v1 by Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos -This research presents a novel multimodal data fusion methodology for pain -behavior recognition, integrating statistical correlation analysis with -human-centered insights. Our approach introduces two key innovations: 1) -integrating data-driven statistical relevance weights into the fusion strategy -to effectively utilize complementary information from heterogeneous modalities, -and 2) incorporating human-centric movement characteristics into multimodal -representation learning for detailed modeling of pain behaviors. Validated -across various deep learning architectures, our method demonstrates superior -performance and broad applicability. We propose a customizable framework that -aligns each modality with a suitable classifier based on statistical -significance, advancing personalized and effective multimodal fusion. -Furthermore, our methodology provides explainable analysis of multimodal data, -contributing to interpretable and explainable AI in healthcare. By highlighting -the importance of data diversity and modality-specific representations, we -enhance traditional fusion techniques and set new standards for recognizing -complex pain behaviors. Our findings have significant implications for -promoting patient-centered healthcare interventions and supporting explainable -clinical decision-making. +This work investigates the ability of open Large Language Models (LLMs) to +predict citation intent through in-context learning and fine-tuning. Unlike +traditional approaches that rely on pre-trained models like SciBERT, which +require extensive domain-specific pretraining and specialized architectures, we +demonstrate that general-purpose LLMs can be adapted to this task with minimal +task-specific data. We evaluate twelve model variations across five prominent +open LLM families using zero, one, few, and many-shot prompting to assess +performance across scenarios. Our experimental study identifies the +top-performing model through extensive experimentation of in-context +learning-related parameters, which we fine-tune to further enhance task +performance. The results highlight the strengths and limitations of LLMs in +recognizing citation intents, providing valuable insights for model selection +and prompt engineering. Additionally, we make our end-to-end evaluation +framework and models openly available for future use. -摘要:本研究提出了一種創新的多模態數據融合方法,用於疼痛行為識別,將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新:1) 將數據驅動的統計相關權重整合到融合策略中,以有效利用來自異質模態的補充信息,以及 2) 將以人為中心的運動特徵納入多模態表示學習中,以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證,展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架,根據統計顯著性將每個模態與合適的分類器對齊,推進個性化和有效的多模態融合。此外,我們的模型提供對多模態數據的可解釋分析,有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性,我們增強了傳統的融合技術,並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。 +摘要:本研究探討開放式大型語言模型 (LLM) 透過情境學習和微調來預測引文意圖的能力。與依賴於預訓練模型(例如 SciBERT)的傳統方法不同,後者需要廣泛的特定領域預訓練和專業架構,我們證明了通用 LLM 可以使用最少的特定任務數據來適應此任務。我們使用零次、一次、少次和多次提示評估五個著名的開放式 LLM 家族中的十二個模型變體,以評估不同場景的效能。我們的實驗研究透過廣泛的實驗來識別情境學習相關參數中效能最佳的模型,我們微調這些參數以進一步增強任務效能。結果突顯了 LLM 在識別引文意圖方面的優點和限制,為模型選擇和提示工程提供了有價值的見解。此外,我們將端到端評估架構和模型公開供未來使用。 -##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach** -2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini +##### **Less is More: Improving LLM Alignment via Preference Data Selection** +2502.14560v1 by Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He -Human-centered explainable AI (HCXAI) advocates for the integration of social -aspects into AI explanations. Central to the HCXAI discourse is the Social -Transparency (ST) framework, which aims to make the socio-organizational -context of AI systems accessible to their users. In this work, we suggest -extending the ST framework to address the risks of social misattributions in -Large Language Models (LLMs), particularly in sensitive areas like mental -health. In fact LLMs, which are remarkably capable of simulating roles and -personas, may lead to mismatches between designers' intentions and users' -perceptions of social attributes, risking to promote emotional manipulation and -dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To -address these issues, we propose enhancing the ST framework with a fifth -'W-question' to clarify the specific social attributions assigned to LLMs by -its designers and users. This addition aims to bridge the gap between LLM -capabilities and user perceptions, promoting the ethically responsible -development and use of LLM-based technology. +Direct Preference Optimization (DPO) has emerged as a promising approach for +aligning large language models with human preferences. While prior work mainly +extends DPO from the aspect of the objective function, we instead improve DPO +from the largely overlooked but critical aspect of data selection. +Specifically, we address the issue of parameter shrinkage caused by noisy data +by proposing a novel margin-maximization principle for dataset curation in DPO +training. To accurately estimate margins for data selection, we propose a +dual-margin guided approach that considers both external reward margins and +implicit DPO reward margins. Extensive experiments demonstrate that our method +reduces computational cost dramatically while improving performance. +Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach +achieves 3\% to 8\% improvements across various Llama and Mistral series models +on the AlpacaEval 2.0 benchmark. Furthermore, our approach seamlessly extends +to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, +while further reducing training time. These results highlight the potential of +data selection strategies for advancing preference optimization. -摘要:以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架,其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中,我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险,尤其是在心理健康等敏感领域。事实上,LLM 能够出色地模拟角色和人格,这可能导致设计者的意图和用户对社会属性的认知之间出现错配,从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题,我们建议用第五个“W 问题”来增强 ST 框架,以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距,促进基于 LLM 的技术在道德上负责任地开发和使用。 +摘要:直接偏好最佳化 (DPO) 已成為一種有希望的方法,可將大型語言模型與人類偏好保持一致。雖然先前的研究主要從目標函數的角度延伸 DPO,但我們反而從資料選擇這個極易被忽略但至關重要的角度改進 DPO。 +具體來說,我們透過提出一個用於 DPO 訓練中資料集整理的新邊際最大化原則,來解決由雜訊資料造成的參數收縮問題。為了準確估計資料選擇的邊際,我們提出一個雙邊際引導方法,它同時考慮外部獎勵邊際和隱含 DPO 獎勵邊際。大規模的實驗證明,我們的這種方法大幅降低了運算成本,同時改善了效能。 +值得注意的是,我們的這種方法僅使用 Ultrafeedback 資料集的 10%,便在 AlpacaEval 2.0 基準上,在各種 Llama 和 Mistral 系列模型中取得了 3% 到 8% 的改進。此外,我們的這種方法可以無縫地延伸到迭代 DPO,在使用 25% 線上資料的情況下產生了大約 3% 的改進,同時進一步減少了訓練時間。這些結果突顯了資料選擇策略在推進偏好最佳化方面的潛力。 -##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification** -2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu +##### **FUIA: Model Inversion Attack against Federated Unlearning** +2502.14558v1 by Lei Zhou, Youwen Zhu -Background: Pneumothorax is an acute thoracic disease caused by abnormal air -collection between the lungs and chest wall. To address the opaqueness often -associated with deep learning (DL) models, explainable artificial intelligence -(XAI) methods have been introduced to outline regions related to pneumothorax -diagnoses made by DL models. However, these explanations sometimes diverge from -actual lesion areas, highlighting the need for further improvement. Method: We -propose a template-guided approach to incorporate the clinical knowledge of -pneumothorax into model explanations generated by XAI methods, thereby -enhancing the quality of these explanations. Utilizing one lesion delineation -created by radiologists, our approach first generates a template that -represents potential areas of pneumothorax occurrence. This template is then -superimposed on model explanations to filter out extraneous explanations that -fall outside the template's boundaries. To validate its efficacy, we carried -out a comparative analysis of three XAI methods with and without our template -guidance when explaining two DL models in two real-world datasets. Results: The -proposed approach consistently improved baseline XAI methods across twelve -benchmark scenarios built on three XAI methods, two DL models, and two -datasets. The average incremental percentages, calculated by the performance -improvements over the baseline performance, were 97.8% in Intersection over -Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model -explanations and ground-truth lesion areas. Conclusions: In the context of -pneumothorax diagnoses, we proposed a template-guided approach for improving AI -explanations. We anticipate that our template guidance will forge a fresh -approach to elucidating AI models by integrating clinical domain expertise. +With the introduction of regulations related to the ``right to be forgotten", +federated learning (FL) is facing new privacy compliance challenges. To address +these challenges, researchers have proposed federated unlearning (FU). However, +existing FU research has primarily focused on improving the efficiency of +unlearning, with less attention paid to the potential privacy vulnerabilities +inherent in these methods. To address this gap, we draw inspiration from +gradient inversion attacks in FL and propose the federated unlearning inversion +attack (FUIA). The FUIA is specifically designed for the three types of FU +(sample unlearning, client unlearning, and class unlearning), aiming to provide +a comprehensive analysis of the privacy leakage risks associated with FU. In +FUIA, the server acts as an honest-but-curious attacker, recording and +exploiting the model differences before and after unlearning to expose the +features and labels of forgotten data. FUIA significantly leaks the privacy of +forgotten data and can target all types of FU. This attack contradicts the goal +of FU to eliminate specific data influence, instead exploiting its +vulnerabilities to recover forgotten data and expose its privacy flaws. +Extensive experimental results show that FUIA can effectively reveal the +private information of forgotten data. To mitigate this privacy leakage, we +also explore two potential defense methods, although these come at the cost of +reduced unlearning effectiveness and the usability of the unlearned model. -摘要:背景:氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習(DL)模型經常伴隨的不透明性,可解釋人工智慧(XAI)方法已被引入,用於概述與 DL 模型做出的氣胸診斷相關的區域。然而,這些解釋有時會與實際病灶區域有所出入,突顯出進一步改進的必要性。方法:我們提出了一種模板引導式方法,將氣胸的臨床知識納入 XAI 方法產生的模型解釋中,從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪,我們的做法首先產生一個模板,用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上,以篩選出超出模板邊界的無關解釋。為了驗證其效力,我們對三種 XAI 方法進行了比較分析,在兩個真實世界資料集中解釋兩個 DL 模型時,分別採用和不採用我們的模板引導。結果:所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中,始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時,透過基準效能的效能改進計算出的平均增量百分比為交集比(IoU)的 97.8% 和骰子相似性係數(DSC)的 94.1%。結論:在氣胸診斷的背景下,我們提出了一種模板引導式方法,用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識,為闡明 AI 模型建立一種新方法。 +摘要:隨著「被遺忘權」相關法規的推出, +聯盟學習 (FL) 面臨新的隱私合規挑戰。為了應對 +這些挑戰,研究人員提出了聯盟取消學習 (FU)。然而, +現有的 FU 研究主要集中在提高取消學習的效率,較少關注這些方法中固有的潛在隱私漏洞。為了解決這個差距,我們從 +FL 中的梯度反演攻擊中汲取靈感,並提出聯盟取消學習反演 +攻擊 (FUIA)。FUIA 專門設計用於三種類型的 FU +(樣本取消學習、客戶端取消學習和類別取消學習),旨在提供 +對與 FU 相關的隱私洩露風險的全面分析。在 +FUIA 中,伺服器充當誠實但好奇的攻擊者,記錄並 +利用取消學習前後的模型差異來揭露遺忘資料的功能和標籤。FUIA 大幅洩露遺忘資料的隱私,並且可以針對所有類型的 FU。此攻擊與 FU 消除特定資料影響的目標相矛盾,而是利用其 +漏洞來恢復遺忘資料並揭露其隱私缺陷。廣泛的實驗結果表明 FUIA 可以有效揭露遺忘資料的私人資訊。為了減輕這種隱私洩露,我們 +還探索了兩種潛在的防禦方法,儘管這些方法以降低取消學習的有效性和已取消學習模型的可用性為代價。 -##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures** -2403.01580v1 by Séamus Lankford +##### **Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling** +2502.14553v1 by Eric Egli, Matteo Manica, Jannis Born -In the current machine translation (MT) landscape, the Transformer -architecture stands out as the gold standard, especially for high-resource -language pairs. This research delves into its efficacy for low-resource -language pairs including both the English$\leftrightarrow$Irish and -English$\leftrightarrow$Marathi language pairs. Notably, the study identifies -the optimal hyperparameters and subword model type to significantly improve the -translation quality of Transformer models for low-resource language pairs. - The scarcity of parallel datasets for low-resource languages can hinder MT -development. To address this, gaHealth was developed, the first bilingual -corpus of health data for the Irish language. Focusing on the health domain, -models developed using this in-domain dataset exhibited very significant -improvements in BLEU score when compared with models from the LoResMT2021 -Shared Task. A subsequent human evaluation using the multidimensional quality -metrics error taxonomy showcased the superior performance of the Transformer -system in reducing both accuracy and fluency errors compared to an RNN-based -counterpart. - Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source -applications streamlined for the development, fine-tuning, and deployment of -neural machine translation models. These tools considerably simplify the setup -and evaluation process, making MT more accessible to both developers and -translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes -eco-friendly natural language processing research by highlighting the -environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM -demonstrated advancements in translation performance for two low-resource -language pairs: English$\leftrightarrow$Irish and -English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 -Shared Task. +Bytes form the basis of the digital world and thus are a promising building +block for multimodal foundation models. Recently, Byte Language Models (BLMs) +have emerged to overcome tokenization, yet the excessive length of bytestreams +requires new architectural paradigms. Therefore, we present the Multiscale Byte +Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows +training with context windows of $5$M bytes on single GPU in full model +precision. We thoroughly examine MBLM's performance with Transformer and Mamba +blocks on both unimodal and multimodal tasks. Our experiments demonstrate that +hybrid architectures are efficient in handling extremely long byte sequences +during training while achieving near-linear generational efficiency. To the +best of our knowledge, we present the first evaluation of BLMs on visual Q\&A +tasks and find that, despite serializing images and the absence of an encoder, +a MBLM with pure next token prediction can match custom CNN-LSTM architectures +with designated classification heads. We show that MBLMs exhibit strong +adaptability in integrating diverse data representations, including pixel and +image filestream bytes, underlining their potential toward omnimodal foundation +models. Source code is publicly available at: +https://github.com/ai4sd/multiscale-byte-lm -摘要:在當前機器翻譯 (MT) 領域中,Transformer 架構脫穎而出,成為黃金標準,特別是對於高資源語言對。本研究探討其對低資源語言對的效能,包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是,本研究識別出最佳超參數和子詞模型類型,以顯著提高 Transformer 模型對低資源語言對的翻譯品質。 -低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題,開發了 gaHealth,這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域,使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步,與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示,與基於 RNN 的對應模型相比,Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。 -此外,本論文介紹了 adaptNMT 和 adaptMLLM,這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程,讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是,adaptNMT 以 OpenNMT 生態系統為基礎,通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比,adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。 +摘要:位元組構成數位世界的基礎,因此是多模態基礎模型的一個有前途的建構模組。最近,位元組語言模型 (BLM) 已應運而生,以克服標記化,但位元組串流的過長需要新的架構範例。因此,我們提出多尺度位元組語言模型 (MBLM),這是一個與模型無關的分層解碼器堆疊,允許在單一 GPU 上以完整的模型精度訓練 500 萬位元組的內容視窗。我們徹底檢驗了 MBLM 在單模態和多模態任務上使用 Transformer 和 Mamba 區塊的效能。我們的實驗證明,混合架構在處理訓練期間極長的位元組序列時很有效率,同時達到近乎線性的生成效率。據我們所知,我們提出在視覺問答任務上對 BLM 的首次評估,並發現,儘管序列化影像且沒有編碼器,但具有純粹下一個標記預測的 MBLM 可以匹配具有指定分類標頭的客製化 CNN-LSTM 架構。我們表明,MBLM 在整合各種資料表示形式方面表現出強大的適應性,包括像素和影像檔案串流位元組,強調它們朝向全模態基礎模型的潛力。原始碼已公開於: +https://github.com/ai4sd/multiscale-byte-lm -##### **Cause and Effect: Can Large Language Models Truly Understand Causality?** -2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha +##### **Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks** +2502.14546v1 by Maya Bechler-Speicher, Ben Finkelshtein, Fabrizio Frasca, Luis Müller, Jan Tönshoff, Antoine Siraudin, Viktor Zaverkin, Michael M. Bronstein, Mathias Niepert, Bryan Perozzi, Mikhail Galkin, Christopher Morris -With the rise of Large Language Models(LLMs), it has become crucial to -understand their capabilities and limitations in deciphering and explaining the -complex web of causal relationships that language entails. Current methods use -either explicit or implicit causal reasoning, yet there is a strong need for a -unified approach combining both to tackle a wide array of causal relationships -more effectively. This research proposes a novel architecture called Context -Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to -enhance causal reasoning and explainability. The proposed framework -incorporates an explicit causal detection module with ConceptNet and -counterfactual statements, as well as implicit causal detection through LLMs. -Our framework goes one step further with a layer of counterfactual explanations -to accentuate LLMs understanding of causality. The knowledge from ConceptNet -enhances the performance of multiple causal reasoning tasks such as causal -discovery, causal identification and counterfactual reasoning. The -counterfactual sentences add explicit knowledge of the not caused by scenarios. -By combining these powerful modules, our model aims to provide a deeper -understanding of causal relationships, enabling enhanced interpretability. -Evaluation of benchmark datasets shows improved performance across all metrics, -such as accuracy, precision, recall, and F1 scores. We also introduce -CausalNet, a new dataset accompanied by our code, to facilitate further -research in this domain. +While machine learning on graphs has demonstrated promise in drug design and +molecular property prediction, significant benchmarking challenges hinder its +further progress and relevance. Current benchmarking practices often lack focus +on transformative, real-world applications, favoring narrow domains like +two-dimensional molecular graphs over broader, impactful areas such as +combinatorial optimization, relational databases, or chip design. Additionally, +many benchmark datasets poorly represent the underlying data, leading to +inadequate abstractions and misaligned use cases. Fragmented evaluations and an +excessive focus on accuracy further exacerbate these issues, incentivizing +overfitting rather than fostering generalizable insights. These limitations +have prevented the development of truly useful graph foundation models. This +position paper calls for a paradigm shift toward more meaningful benchmarks, +rigorous evaluation protocols, and stronger collaboration with domain experts +to drive impactful and reliable advances in graph learning research, unlocking +the potential of graph learning. -摘要:隨著大型語言模型 (LLM) 的興起,了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理,但強烈需要一種統一的方法,結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構,以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組,以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步,加入一層反事實解釋,以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行,例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組,我們的模型旨在提供對因果關係更深入的理解,實現增強的可解釋性。基準資料集的評估顯示在所有指標(例如準確度、精確度、召回率和 F1 分數)上都有所提升。我們還引入了 CausalNet,一個新的資料集,並附上了我們的程式碼,以促進在這個領域的進一步研究。 +摘要:儘管圖形上的機器學習在藥物設計和分子屬性預測方面已展現潛力,但顯著的基準挑戰阻礙了其進一步進展和相關性。目前的基準實務往往缺乏對轉型性、真實世界應用的關注,偏好於狹窄的領域,例如二維分子圖形,而不是組合最佳化、關係資料庫或晶片設計等更廣泛、更有影響力的領域。此外,許多基準資料集無法充分表示基礎資料,導致抽象化不充分和使用案例錯位。支離破碎的評估和過度關注準確性進一步加劇了這些問題,激勵過度擬合,而不是培養可概括的見解。這些限制阻礙了真正有用的圖形基礎模型的開發。這篇立場文件呼籲將範例轉變為更有意義的基準、嚴格的評估協定,以及與領域專家的更強大合作,以推動圖形學習研究中具有影響力和可靠性的進展,釋放圖形學習的潛力。 -##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina** -2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya +##### **LLM-based User Profile Management for Recommender System** +2502.14541v1 by Seunghwan Bang, Hwanjun Song -Diabetes mellitus (DM) predisposes patients to vascular complications. -Retinal images and vasculature reflect the body's micro- and macrovascular -health. They can be used to diagnose DM complications, including diabetic -retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular -disease, as well as forecast the risk of cardiovascular events. Artificial -intelligence (AI)-enabled systems developed for high-throughput detection of DR -using digitized retinal images have become clinically adopted. Beyond DR -screening, AI integration also holds immense potential to address challenges -associated with the holistic care of the patient with DM. In this work, we aim -to comprehensively review the literature for studies on AI applications based -on retinal images related to DM diagnosis, prognostication, and management. We -will describe the findings of holistic AI-assisted diabetes care, including but -not limited to DR screening, and discuss barriers to implementing such systems, -including issues concerning ethics, data privacy, equitable access, and -explainability. With the ability to evaluate the patient's health status vis a -vis DM complication as well as risk prognostication of future cardiovascular -complications, AI-assisted retinal image analysis has the potential to become a -central tool for modern personalized medicine in patients with DM. +The rapid advancement of Large Language Models (LLMs) has opened new +opportunities in recommender systems by enabling zero-shot recommendation +without conventional training. Despite their potential, most existing works +rely solely on users' purchase histories, leaving significant room for +improvement by incorporating user-generated textual data, such as reviews and +product descriptions. Addressing this gap, we propose PURE, a novel LLM-based +recommendation framework that builds and maintains evolving user profiles by +systematically extracting and summarizing key information from user reviews. +PURE consists of three core components: a Review Extractor for identifying user +preferences and key product features, a Profile Updater for refining and +updating user profiles, and a Recommender for generating personalized +recommendations using the most current profile. To evaluate PURE, we introduce +a continuous sequential recommendation task that reflects real-world scenarios +by adding reviews over time and updating predictions incrementally. Our +experimental results on Amazon datasets demonstrate that PURE outperforms +existing LLM-based methods, effectively leveraging long-term user information +while managing token limitations. -摘要:糖尿病(DM)使患者容易出現血管併發症。 -視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症,包括糖尿病視網膜病變(DR)、神經病變、腎病和動脈粥樣硬化性心血管疾病,以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧(AI)啟用系統已在臨床採用。除了 DR 篩檢外,AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中,我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻,這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現,包括但不限於 DR 篩檢,並討論實施此類系統的障礙,包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況,同時考量糖尿病併發症以及未來心血管併發症的風險預後,AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。 +摘要:大型語言模型 (LLM) 的快速進步為推薦系統開啟了新的機會,它能實現零次學習推薦,而無需傳統訓練。儘管有潛力,但現有的大部分工作僅依賴於使用者的購買記錄,透過納入使用者產生的文字資料,例如評論和產品說明,仍有很大的改進空間。針對此差距,我們提出 PURE,一個新穎的基於 LLM 的推薦架構,透過系統性地從使用者評論中提取和總結關鍵資訊,建立並維護不斷演進的使用者檔案。PURE 由三個核心組成部分組成:一個評論萃取器,用於識別使用者的喜好和產品主要功能;一個檔案更新器,用於精煉和更新使用者檔案;一個推薦器,用於使用最新的檔案產生個人化推薦。為了評估 PURE,我們引入一個連續順序推薦任務,透過隨著時間新增評論和遞增更新預測,反映真實世界的場景。我們在 Amazon 資料集上的實驗結果證明,PURE 優於現有的基於 LLM 的方法,在管理符號限制的同時,有效地利用長期使用者資訊。 +##### **LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization** +2502.14538v1 by Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu -### Medical -|Publish Date|Title|Authors|Homepage|Code| -| :---: | :---: | :---: | :---: | :---: | -|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| -|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| -|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| -|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| -|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| -|**2025-02-20**|**MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models**|Shrey Pandit et.al.|[2502.14302v1](http://arxiv.org/abs/2502.14302v1)|null| -|**2025-02-20**|**EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement**|Wenhui Zhu et.al.|[2502.14260v1](http://arxiv.org/abs/2502.14260v1)|null| -|**2025-02-19**|**Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning**|Cole Gawin et.al.|[2502.14086v1](http://arxiv.org/abs/2502.14086v1)|null| -|**2025-02-19**|**Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging**|Shansong Wang et.al.|[2502.14064v1](http://arxiv.org/abs/2502.14064v1)|null| -|**2025-02-19**|**VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare**|Anudeex Shetty et.al.|[2502.13775v1](http://arxiv.org/abs/2502.13775v1)|null| -|**2025-02-19**|**PeerQA: A Scientific Question Answering Dataset from Peer Reviews**|Tim Baumgärtner et.al.|[2502.13668v1](http://arxiv.org/abs/2502.13668v1)|[link](https://github.com/ukplab/peerqa)| -|**2025-02-19**|**Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs**|Yushi Feng et.al.|[2502.13555v1](http://arxiv.org/abs/2502.13555v1)|[link](https://github.com/ys-feng/DemoGraph)| -|**2025-02-19**|**MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis**|Wei Dai et.al.|[2502.13524v1](http://arxiv.org/abs/2502.13524v1)|[link](https://github.com/anthonyweidai/MobileViM_3D)| -|**2025-02-19**|**Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion**|Shuai Niu et.al.|[2502.13509v1](http://arxiv.org/abs/2502.13509v1)|null| -|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| -|**2025-02-19**|**RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering**|Sichu Liang et.al.|[2502.13361v1](http://arxiv.org/abs/2502.13361v1)|null| -|**2025-02-18**|**Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance**|Tejas Srinivasan et.al.|[2502.13321v1](http://arxiv.org/abs/2502.13321v1)|null| -|**2025-02-18**|**Prediction of Clinical Complication Onset using Neural Point Processes**|Sachini Weerasekara et.al.|[2502.13290v1](http://arxiv.org/abs/2502.13290v1)|null| -|**2025-02-18**|**SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?**|Yucheng Shi et.al.|[2502.13233v1](http://arxiv.org/abs/2502.13233v1)|null| -|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null| -|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null| -|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null| -|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Li et.al.|[2502.12825v2](http://arxiv.org/abs/2502.12825v2)|null| -|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|[link](https://github.com/Avenge-PRC777/LLM-Safety-For-Children-Code)| -|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null| -|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null| -|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|[link](https://github.com/AmmarKheder/AQ-Net)| -|**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null| -|**2025-02-17**|**LLM Agents Making Agent Tools**|Georg Wölflein et.al.|[2502.11705v1](http://arxiv.org/abs/2502.11705v1)|null| -|**2025-02-17**|**MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression**|Linjie Mu et.al.|[2502.11651v1](http://arxiv.org/abs/2502.11651v1)|[link](https://github.com/linjiemu/mmxu)| -|**2025-02-17**|**A Survey of Personalized Large Language Models: Progress and Future Directions**|Jiahong Liu et.al.|[2502.11528v1](http://arxiv.org/abs/2502.11528v1)|null| -|**2025-02-17**|**Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos**|Xiangxiang Cui et.al.|[2502.11481v1](http://arxiv.org/abs/2502.11481v1)|null| -|**2025-02-17**|**Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation**|Yanyan Wang et.al.|[2502.11456v1](http://arxiv.org/abs/2502.11456v1)|[link](https://github.com/Yaan-Wang/CRLN)| -|**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null| -|**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|[link](https://github.com/sohyu1/rt-demt)| -|**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|[link](https://github.com/alexlecu/llmkgraph)| -|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null| -|**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|[link](https://github.com/clmfap/clmfap)| -|**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null| -|**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null| -|**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|[link](https://github.com/hasakixie123/llm-evaluator-uncertainty)| -|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)| -|**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null| -|**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null| -|**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null| -|**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null| -|**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v2](http://arxiv.org/abs/2502.10526v2)|null| -|**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null| -|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| -|**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null| -|**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null| -|**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null| -|**2025-02-14**|**HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation**|Tianwei Lin et.al.|[2502.09838v2](http://arxiv.org/abs/2502.09838v2)|[link](https://github.com/dcdmllm/healthgpt)| -|**2025-02-13**|**Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games**|Tong Yang et.al.|[2502.09780v1](http://arxiv.org/abs/2502.09780v1)|null| -|**2025-02-13**|**The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention**|Bereket A. Yilma et.al.|[2502.09757v1](http://arxiv.org/abs/2502.09757v1)|null| -|**2025-02-13**|**A CNN Approach to Automated Detection and Classification of Brain Tumors**|Md. Zahid Hasan et.al.|[2502.09731v1](http://arxiv.org/abs/2502.09731v1)|null| -|**2025-02-13**|**Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data**|Yu Leng et.al.|[2502.09715v1](http://arxiv.org/abs/2502.09715v1)|null| -|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null| -|**2025-02-13**|**Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling**|Benjamin D. Killeen et.al.|[2502.09688v1](http://arxiv.org/abs/2502.09688v1)|null| -|**2025-02-13**|**Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models**|Wiktoria Mieleszczenko-Kowszewicz et.al.|[2502.09687v1](http://arxiv.org/abs/2502.09687v1)|null| -|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null| -|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null| -|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| -|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null| -|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null| -|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null| -|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)| -|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)| -|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null| -|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null| -|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null| -|**2025-02-12**|**Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models**|Hasin Rehana et.al.|[2502.09659v1](http://arxiv.org/abs/2502.09659v1)|null| -|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null| -|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null| -|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v2](http://arxiv.org/abs/2502.07752v2)|null| -|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)| -|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)| -|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null| -|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)| -|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null| -|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null| -|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null| -|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null| -|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null| -|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null| -|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null| -|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null| -|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null| -|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null| -|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| -|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null| -|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)| -|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null| -|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null| -|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null| -|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null| -|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null| -|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null| -|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null| -|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)| +Large Language Models (LLMs) have achieved remarkable success in natural +language processing, but their full fine-tuning remains resource-intensive. +Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation +(LoRA), have emerged as a practical solution by approximating parameter updates +with low-rank matrices. However, LoRA often exhibits a "double descent" +phenomenon during fine-tuning, where model performance degrades due to +overfitting and limited expressiveness caused by low-rank constraints. To +address this issue, we propose LoRA-GGPO (Gradient-Guided Perturbation +Optimization), a novel method that leverages gradient and weight norms to +generate targeted perturbations. By optimizing the sharpness of the loss +landscape, LoRA-GGPO guides the model toward flatter minima, mitigating the +double descent problem and improving generalization. Extensive experiments on +natural language understanding (NLU) and generation (NLG) tasks demonstrate +that LoRA-GGPO outperforms LoRA and its state-of-the-art variants. Furthermore, +extended experiments specifically designed to analyze the double descent +phenomenon confirm that LoRA-GGPO effectively alleviates this issue, producing +more robust and generalizable models. Our work provides a robust and efficient +solution for fine-tuning LLMs, with broad applicability in real-world +scenarios. The code is available at https://github.com/llm172/LoRA-GGPO. -#### Abstracts -##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** -2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub +摘要:大型語言模型 (LLM) 在自然語言處理方面取得了顯著的成功,但它們的完全微調仍然需要大量資源。參數高效微調 (PEFT) 方法(例如低秩適應 (LoRA))已成為一種實用的解決方案,它通過低秩矩陣近似參數更新。然而,LoRA 在微調過程中經常表現出「雙重下降」現象,其中模型性能會因過度擬合和低秩約束導致的表達能力有限而下降。為了解決這個問題,我們提出了 LoRA-GGPO(梯度引導擾動優化),這是一種利用梯度和權重範數來產生目標擾動的新方法。通過優化損失函數曲面的陡度,LoRA-GGPO 引導模型朝向更平坦的最小值,從而減輕雙重下降問題並改善泛化能力。在自然語言理解 (NLU) 和生成 (NLG) 任務中進行的廣泛實驗表明,LoRA-GGPO 優於 LoRA 及其最先進的變體。此外,專門設計用於分析雙重下降現象的延伸實驗證實,LoRA-GGPO 有效地緩解了這個問題,產生了更強大且更具泛化能力的模型。我們的研究為微調 LLM 提供了一個強大且高效的解決方案,在現實世界場景中具有廣泛的適用性。代碼可在 https://github.com/llm172/LoRA-GGPO 獲得。 -Foundation models are becoming increasingly effective in the medical domain, -offering pre-trained models on large datasets that can be readily adapted for -downstream tasks. Despite progress, fetal ultrasound images remain a -challenging domain for foundation models due to their inherent complexity, -often requiring substantial additional training and facing limitations due to -the scarcity of paired multimodal data. To overcome these challenges, here we -introduce FetalCLIP, a vision-language foundation model capable of generating -universal representation of fetal ultrasound images. FetalCLIP was pre-trained -using a multimodal learning approach on a diverse dataset of 210,035 fetal -ultrasound images paired with text. This represents the largest paired dataset -of its kind used for foundation model development to date. This unique training -approach allows FetalCLIP to effectively learn the intricate anatomical -features present in fetal ultrasound images, resulting in robust -representations that can be used for a variety of downstream applications. In -extensive benchmarking across a range of key fetal ultrasound applications, -including classification, gestational age estimation, congenital heart defect -(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all -baselines while demonstrating remarkable generalizability and strong -performance even with limited labeled data. We plan to release the FetalCLIP -model publicly for the benefit of the broader scientific community. +##### **CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models** +2502.14529v1 by Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo -摘要:基礎模型在醫療領域正變得越來越有效, -提供在大型資料集上預先訓練的模型,可輕鬆適應 -下游任務。儘管有進展,但胎兒超音波影像仍然是 -基礎模型的挑戰領域,因為它們固有的複雜性, -通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 -介紹 FetalCLIP,一種能夠產生 -胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 -超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 -方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 -表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 -(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 -效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 +Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated +remarkable real-world capabilities, effectively collaborating to complete +complex tasks. While these systems are designed with safety mechanisms, such as +rejecting harmful instructions through alignment, their security remains +largely unexplored. This gap leaves LLM-MASs vulnerable to targeted +disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks +(Corba), a novel and simple yet highly effective attack that disrupts +interactions between agents within an LLM-MAS. Corba leverages two key +properties: its contagious nature allows it to propagate across arbitrary +network topologies, while its recursive property enables sustained depletion of +computational resources. Notably, these blocking attacks often involve +seemingly benign instructions, making them particularly challenging to mitigate +using conventional alignment methods. We evaluate Corba on two widely-used +LLM-MASs, namely, AutoGen and Camel across various topologies and commercial +models. Additionally, we conduct more extensive experiments in open-ended +interactive LLM-MASs, demonstrating the effectiveness of Corba in complex +topology structures and open-source models. Our code is available at: +https://github.com/zhrli324/Corba. -##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** -2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes +摘要:基於大型語言模型的多主體系統(LLM-MAS)已展現出卓越的真實世界能力,有效地協作以完成複雜任務。儘管這些系統設計有安全機制,例如透過對齊拒絕有害指令,但其安全性仍未得到充分探討。此一缺口讓 LLM-MAS 易受針對性的破壞。在本文中,我們介紹了傳染性遞迴封鎖攻擊(Corba),這是一種新穎且簡單但極為有效的攻擊,會破壞 LLM-MAS 中主體之間的互動。Corba 利用了兩個關鍵特性:其傳染性使其能夠在任意網路拓撲中傳播,而其遞迴特性則能持續耗盡運算資源。值得注意的是,這些封鎖攻擊通常涉及看似良性的指令,這使得使用傳統對齊方法來減輕攻擊特別具有挑戰性。我們在兩個廣泛使用的 LLM-MAS,即 AutoGen 和 Camel 上評估了 Corba,涵蓋了各種拓撲和商業模型。此外,我們在開放式互動 LLM-MAS 中進行了更廣泛的實驗,證明了 Corba 在複雜拓撲結構和開源模型中的有效性。我們的程式碼可在以下網址取得:https://github.com/zhrli324/Corba。 -Fact verification (FV) aims to assess the veracity of a claim based on -relevant evidence. The traditional approach for automated FV includes a -three-part pipeline relying on short evidence snippets and encoder-only -inference models. More recent approaches leverage the multi-turn nature of LLMs -to address FV as a step-by-step problem where questions inquiring additional -context are generated and answered until there is enough information to make a -decision. This iterative method makes the verification process rational and -explainable. While these methods have been tested for encyclopedic claims, -exploration on domain-specific and realistic claims is missing. In this work, -we apply an iterative FV system on three medical fact-checking datasets and -evaluate it with multiple settings, including different LLMs, external web -search, and structured reasoning using logic predicates. We demonstrate -improvements in the final performance over traditional approaches and the high -potential of step-by-step FV systems for domain-specific claims. +##### **Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting** +2502.14525v1 by Yannick Wölker, Arash Hajisafi, Cyrus Shahabi, Matthias Renz -摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 +We propose a novel Graph Neural Network (GNN) model, named DeepStateGNN, for +analyzing traffic data, demonstrating its efficacy in two critical tasks: +forecasting and reconstruction. Unlike typical GNN methods that treat each +traffic sensor as an individual graph node, DeepStateGNN clusters sensors into +higher-level graph nodes, dubbed Deep State Nodes, based on various similarity +criteria, resulting in a fixed number of nodes in a Deep State graph. The term +"Deep State" nodes is a play on words, referencing hidden networks of power +that, like these nodes, secretly govern traffic independently of visible +sensors. These Deep State Nodes are defined by several similarity factors, +including spatial proximity (e.g., sensors located nearby in the road network), +functional similarity (e.g., sensors on similar types of freeways), and +behavioral similarity under specific conditions (e.g., traffic behavior during +rain). This clustering approach allows for dynamic and adaptive node grouping, +as sensors can belong to multiple clusters and clusters may evolve over time. +Our experimental results show that DeepStateGNN offers superior scalability and +faster training, while also delivering more accurate results than competitors. +It effectively handles large-scale sensor networks, outperforming other methods +in both traffic forecasting and reconstruction accuracy. -##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** -2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari +摘要:我們提出一個名為 DeepStateGNN 的新穎圖形神經網路 (GNN) 模型,用於分析交通數據,並展示其在兩個關鍵任務中的效能:預測和重建。與將每個交通感測器視為個別圖形節點的典型 GNN 方法不同,DeepStateGNN 會根據各種相似性準則將感測器群集到較高層級的圖形節點中,稱為 Deep State 節點,這會在 Deep State 圖形中產生固定數量的節點。「Deep State」節點這個術語是文字遊戲,指的是隱藏的權力網路,就像這些節點一樣,秘密地獨立於可見感測器管理交通。這些 Deep State 節點由幾個相似性因素定義,包括空間接近性(例如,位於道路網路中附近的感測器)、功能相似性(例如,位於類似類型高速公路上的感測器)以及特定條件下的行為相似性(例如,雨中的交通行為)。這種群集方法允許動態和自適應節點分組,因為感測器可以屬於多個群集,而且群集可能會隨著時間演變。我們的實驗結果顯示,DeepStateGNN 提供了卓越的可擴充性和更快的訓練速度,同時也比競爭對手提供了更準確的結果。它有效地處理了大規模感測器網路,在交通預測和重建準確度方面都優於其他方法。 -Medical images are acquired at high resolutions with large fields of view in -order to capture fine-grained features necessary for clinical decision-making. -Consequently, training deep learning models on medical images can incur large -computational costs. In this work, we address the challenge of downsizing -medical images in order to improve downstream computational efficiency while -preserving clinically-relevant features. We introduce MedVAE, a family of six -large-scale 2D and 3D autoencoders capable of encoding medical images as -downsized latent representations and decoding latent representations back to -high-resolution images. We train MedVAE autoencoders using a novel two-stage -training approach with 1,052,730 medical images. Across diverse tasks obtained -from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent -representations in place of high-resolution images when training downstream -models can lead to efficiency benefits (up to 70x improvement in throughput) -while simultaneously preserving clinically-relevant features and (2) MedVAE can -decode latent representations back to high-resolution images with high -fidelity. Our work demonstrates that large-scale, generalizable autoencoders -can help address critical efficiency challenges in the medical domain. Our code -is available at https://github.com/StanfordMIMI/MedVAE. +##### **Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation** +2502.14523v1 by Austin A. Barr, Robert Rozman, Eddie Guo -摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 +We propose a new framework for zero-shot generation of synthetic tabular +data. Using the large language model (LLM) GPT-4o and plain-language prompting, +we demonstrate the ability to generate high-fidelity tabular data without +task-specific fine-tuning or access to real-world data (RWD) for pre-training. +To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated +synthetic data against data generated with the conditional tabular generative +adversarial network (CTGAN), across three open-access datasets: Iris, Fish +Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o +outperformed CTGAN in preserving means, 95% confidence intervals, bivariate +correlations, and data privacy of RWD, even at amplified sample sizes. Notably, +correlations between parameters were consistently preserved with appropriate +direction and strength. However, refinement is necessary to better retain +distributional characteristics. These findings highlight the potential of LLMs +in tabular data synthesis, offering an accessible alternative to generative +adversarial networks and variational autoencoders. -##### **Data-Constrained Synthesis of Training Data for De-Identification** -2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis +摘要:我們提出一個新的架構,用於合成表格資料的零次學習產生。利用大型語言模型 (LLM) GPT-4o 和自然語言提示,我們證明了在沒有特定任務微調或取得真實世界資料 (RWD) 進行預訓練的情況下,產生高保真表格資料的能力。為了對 GPT-4o 進行基準測試,我們比較了 LLM 生成的合成資料與使用條件表格生成對抗網路 (CTGAN) 生成的資料在保真度和隱私性方面的表現,比較對象是三個開放取用的資料集:鳶尾花、魚類測量和房地產估價。儘管採用零次學習方法,GPT-4o 在保留平均值、95% 信賴區間、二元關聯和 RWD 的資料隱私方面都優於 CTGAN,即使在擴增的樣本大小下也是如此。值得注意的是,參數之間的關聯始終保持適當的方向和強度。然而,需要進行改進以更好地保留分佈特徵。這些發現突顯了 LLM 在表格資料合成中的潛力,為生成對抗網路和變異自動編碼器提供了可行的替代方案。 -Many sensitive domains -- such as the clinical domain -- lack widely -available datasets due to privacy risks. The increasing generative capabilities -of large language models (LLMs) have made synthetic datasets a viable path -forward. In this study, we domain-adapt LLMs to the clinical domain and -generate synthetic clinical texts that are machine-annotated with tags for -personally identifiable information using capable encoder-based NER models. The -synthetic corpora are then used to train synthetic NER models. The results show -that training NER models using synthetic corpora incurs only a small drop in -predictive performance. The limits of this process are investigated in a -systematic ablation study -- using both Swedish and Spanish data. Our analysis -shows that smaller datasets can be sufficient for domain-adapting LLMs for data -synthesis. Instead, the effectiveness of this process is almost entirely -contingent on the performance of the machine-annotating NER models trained -using the original data. +##### **MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality** +2502.14509v1 by Artur Kot, Mikołaj Koszowski, Wojciech Chojnowski, Mieszko Rutkowski, Artur Nowakowski, Kamil Guttmann, Mikołaj Pokrywka -摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 +Does multilingual Neural Machine Translation (NMT) lead to The Curse of the +Multlinguality or provides the Cross-lingual Knowledge Transfer within a +language family? In this study, we explore multiple approaches for extending +the available data-regime in NMT and we prove cross-lingual benefits even in +0-shot translation regime for low-resource languages. With this paper, we +provide state-of-the-art open-source NMT models for translating between +selected Slavic languages. We released our models on the HuggingFace Hub +(https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) under +the CC BY 4.0 license. Slavic language family comprises morphologically rich +Central and Eastern European languages. Although counting hundreds of millions +of native speakers, Slavic Neural Machine Translation is under-studied in our +opinion. Recently, most NMT research focuses either on: high-resource languages +like English, Spanish, and German - in WMT23 General Translation Task 7 out of +8 task directions are from or to English; massively multilingual models +covering multiple language groups; or evaluation techniques. -##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** -2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu +摘要:多語言神經機器翻譯 (NMT) 是否會導致多語言的詛咒,或在語言家族中提供跨語言知識轉移?在這項研究中,我們探討了多種擴展 NMT 中可用資料範圍的方法,並證明了即使在低資源語言的零次學習翻譯中也有跨語言的優點。透過這篇論文,我們提供了最先進的開源 NMT 模型,用於翻譯選定的斯拉夫語。我們在 HuggingFace Hub (https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) 下根據 CC BY 4.0 授權發布我們的模型。斯拉夫語系包含形態豐富的中歐和東歐語言。儘管擁有數億母語人士,但我們認為斯拉夫神經機器翻譯的研究不足。最近,大多數 NMT 研究都專注於:高資源語言,例如英語、西班牙語和德語 - 在 WMT23 一般翻譯任務中,8 個任務方向中有 7 個來自英語或翻譯成英語;涵蓋多個語言群組的大規模多語言模型;或評估技術。 -Protein backbone generation plays a central role in de novo protein design -and is significant for many biological and medical applications. Although -diffusion and flow-based generative models provide potential solutions to this -challenging task, they often generate proteins with undesired designability and -suffer computational inefficiency. In this study, we propose a novel rectified -quaternion flow (ReQFlow) matching method for fast and high-quality protein -backbone generation. In particular, our method generates a local translation -and a 3D rotation from random noise for each residue in a protein chain, which -represents each 3D rotation as a unit quaternion and constructs its flow by -spherical linear interpolation (SLERP) in an exponential format. We train the -model by quaternion flow (QFlow) matching with guaranteed numerical stability -and rectify the QFlow model to accelerate its inference and improve the -designability of generated protein backbones, leading to the proposed ReQFlow -model. Experiments show that ReQFlow achieves state-of-the-art performance in -protein backbone generation while requiring much fewer sampling steps and -significantly less inference time (e.g., being 37x faster than RFDiffusion and -62x faster than Genie2 when generating a backbone of length 300), demonstrating -its effectiveness and efficiency. The code is available at -https://github.com/AngxiaoYue/ReQFlow. +##### **Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases** +2502.14507v1 by Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau + +This study evaluates Large Language Models' (LLMs) ability to simulate +non-native-like English use observed in human second language (L2) learners +interfered with by their native first language (L1). In dialogue-based +interviews, we prompt LLMs to mimic L2 English learners with specific L1s +(e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to +real L2 learner data. Our analysis examines L1-driven linguistic biases, such +as reference word usage and avoidance behaviors, using information-theoretic +and distributional density measures. Results show that modern LLMs (e.g., +Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed +in human L2 data, with distinct influences from various languages (e.g., +Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu +influences noun-verb collocations). Our results reveal the potential of LLMs +for L2 dialogue generation and evaluation for future educational applications. + +摘要:本研究評估大型語言模型 (LLM) 模擬非母語英語使用者的能力,這些使用者會受到母語 (L1) 干擾,而母語是第二語言 (L2) 學習者。在基於對話的訪談中,我們提示 LLM 模仿具有特定 L1(例如日語、泰語、烏爾都語)的 L2 英語學習者,並比較七種語言的輸出與真實的 L2 學習者資料。我們的分析使用資訊理論和分佈密度測量來檢視 L1 驅動的語言偏差,例如參考詞使用和避免行為。結果顯示,現代 LLM(例如 Qwen2.5、LLAMA3.3、DeepseekV3、GPT-4o)複製了在人類 L2 資料中觀察到的 L1 相依模式,並受到各種語言的明顯影響(例如,日語、韓語和普通話顯著影響時態一致性,而烏爾都語影響名詞動詞搭配)。我們的結果揭示了 LLM 在 L2 對話產生和評估方面的潛力,可供未來教育應用使用。 + +##### **PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models** +2502.14504v1 by Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang + +Large Vision-Language Models (LVLMs) have demonstrated remarkable +capabilities across a range of multimodal tasks. However, their inference +efficiency is constrained by the large number of visual tokens processed during +decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token +Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level +Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the +Vision Token Re-attention phenomenon across decoder layers, we dynamically +adjust token retention rates layer by layer. Layers that exhibit stronger +attention to visual information preserve more vision tokens, while layers with +lower vision attention are aggressively pruned. Furthermore, PLPHP applies +pruning at the attention head level, enabling different heads within the same +layer to independently retain critical context. Experiments on multiple +benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and +reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of +0.46% average performance drop, while also achieving notable performance +improvements in multi-image tasks. These results highlight the effectiveness of +fine-grained token pruning and contribute to advancing the efficiency and +scalability of LVLMs. Our source code will be made publicly available. + +摘要:大型視覺語言模型 (LVLMs) 已在各種多模態任務中展現出非凡的能力。然而,其推理效率受到解碼過程中處理的大量視覺符號的限制。為了應對這一挑戰,我們提出逐層逐頭視覺符號剪枝 (PLPHP),這是一種包括層級保留率分配和頭級視覺符號剪枝的兩級細粒度剪枝方法。受解碼器層中視覺符號重新關注現象的啟發,我們動態地逐層調整符號保留率。對視覺資訊表現出更強關注力的層保留更多視覺符號,而視覺關注力較低的層則被積極剪枝。此外,PLPHP 在關注頭級別應用剪枝,使同一層中的不同頭部可以獨立保留關鍵上下文。在多個基準測試上的實驗表明,PLPHP 的解碼速度提高了 18%,且將鍵值快取 (KV 快取) 大小減少了 50% 以上,而代價僅為平均效能下降 0.46%,同時還在多影像任務中實現了顯著的效能提升。這些結果突顯了細粒度符號剪枝的有效性,並有助於提升 LVLMs 的效率和可擴充性。我們的原始碼將公開提供。 + +##### **How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?** +2502.14502v1 by Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov + +The performance of Large Language Models (LLMs) on many tasks is greatly +limited by the knowledge learned during pre-training and stored in the model's +parameters. Low-rank adaptation (LoRA) is a popular and efficient training +technique for updating or domain-specific adaptation of LLMs. In this study, we +investigate how new facts can be incorporated into the LLM using LoRA without +compromising the previously learned knowledge. We fine-tuned +Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our +experiments have shown that the best results are obtained when the training +data contains a mixture of known and new facts. However, this approach is still +potentially harmful because the model's performance on external +question-answering benchmarks declines after such fine-tuning. When the +training data is biased towards certain entities, the model tends to regress to +few overrepresented answers. In addition, we found that the model becomes more +confident and refuses to provide an answer in only few cases. These findings +highlight the potential pitfalls of LoRA-based LLM updates and underscore the +importance of training data composition and tuning parameters to balance new +knowledge integration and general model capabilities. -摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 +摘要:大型語言模型 (LLM) 在許多任務上的表現受到預訓練期間學到的知識和儲存在模型參數中的知識的極大限制。低階適應 (LoRA) 是一種流行且有效的訓練技術,用於更新或 LLM 的特定領域適應。在這項研究中,我們探討如何使用 LoRA 將新事實納入 LLM,同時不損害先前學到的知識。我們使用不同數量的知識微調 Llama-3.1-8B-instruct。我們的實驗表明,當訓練資料包含已知和新事實的混合時,會獲得最佳結果。然而,這種方法仍然具有潛在的危害性,因為模型在外部問答基準上的表現會在這種微調後下降。當訓練資料偏向於某些實體時,模型傾向於回歸到少數過度表示的答案。此外,我們發現模型變得更有信心,並且在極少數情況下拒絕提供答案。這些發現突顯了基於 LoRA 的 LLM 更新的潛在缺點,並強調了訓練資料組成和調整參數以平衡新知識整合和一般模型能力的重要性。 -##### **MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models** -2502.14302v1 by Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding +##### **Towards a Perspectivist Turn in Argument Quality Assessment** +2502.14501v1 by Julia Romberg, Maximilian Maurer, Henning Wachsmuth, Gabriella Lapesa -Advancements in Large Language Models (LLMs) and their increasing use in -medical question-answering necessitate rigorous evaluation of their -reliability. A critical challenge lies in hallucination, where models generate -plausible yet factually incorrect outputs. In the medical domain, this poses -serious risks to patient safety and clinical decision-making. To address this, -we introduce MedHallu, the first benchmark specifically designed for medical -hallucination detection. MedHallu comprises 10,000 high-quality question-answer -pairs derived from PubMedQA, with hallucinated answers systematically generated -through a controlled pipeline. Our experiments show that state-of-the-art LLMs, -including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, -struggle with this binary hallucination detection task, with the best model -achieving an F1 score as low as 0.625 for detecting "hard" category -hallucinations. Using bidirectional entailment clustering, we show that -harder-to-detect hallucinations are semantically closer to ground truth. -Through experiments, we also show incorporating domain-specific knowledge and -introducing a "not sure" category as one of the answer categories improves the -precision and F1 scores by up to 38% relative to baselines. +The assessment of argument quality depends on well-established logical, +rhetorical, and dialectical properties that are unavoidably subjective: +multiple valid assessments may exist, there is no unequivocal ground truth. +This aligns with recent paths in machine learning, which embrace the +co-existence of different perspectives. However, this potential remains largely +unexplored in NLP research on argument quality. One crucial reason seems to be +the yet unexplored availability of suitable datasets. We fill this gap by +conducting a systematic review of argument quality datasets. We assign them to +a multi-layered categorization targeting two aspects: (a) What has been +annotated: we collect the quality dimensions covered in datasets and +consolidate them in an overarching taxonomy, increasing dataset comparability +and interoperability. (b) Who annotated: we survey what information is given +about annotators, enabling perspectivist research and grounding our +recommendations for future actions. To this end, we discuss datasets suitable +for developing perspectivist models (i.e., those containing individual, +non-aggregated annotations), and we showcase the importance of a controlled +selection of annotators in a pilot study. -摘要:大型語言模型 (LLM) 的進步及其在醫療問答中的使用日益增加,因此需要嚴格評估其可靠性。一個關鍵的挑戰在於幻覺,模型會產生看似合理但事實上不正確的輸出。在醫療領域,這對患者安全和臨床決策構成嚴重風險。為了解決此問題,我們推出了 MedHallu,這是第一個專門設計用於檢測醫療幻覺的基準。MedHallu 包含 10,000 個從 PubMedQA 衍生的高品質問答對,並透過受控管道系統性地產生幻覺答案。我們的實驗顯示,包括 GPT-4o、Llama-3.1 和經過醫學微調的 UltraMedical 在內的最新 LLM 難以執行這個二元幻覺檢測任務,最佳模型在檢測「困難」類別幻覺時達到的 F1 分數低至 0.625。使用雙向蘊涵聚類,我們表明較難檢測的幻覺在語義上更接近真實。透過實驗,我們還表明,納入特定領域的知識並將「不確定」類別作為其中一個答案類別,可以將精確度和 F1 分數相對於基線提高多達 38%。 +摘要:論證品質的評估取決於根深蒂固的邏輯、修辭和辯證屬性,這些屬性難免具有主觀性:可能存在多種有效的評估,沒有明確的真實依據。這與機器學習中最近的途徑一致,這些途徑接受了不同觀點的共存。然而,這種潛力在論證品質的 NLP 研究中仍然很大程度上未被探索。一個關鍵原因似乎是尚未探索合適的資料集的可用性。我們通過對論證品質資料集進行系統性回顧來填補這一空白。我們將它們分配到一個多層次分類,針對兩個方面:(a) 已註釋的內容:我們收集資料集中涵蓋的品質維度,並將它們整合到一個總體分類法中,提高資料集的可比性和互操作性。(b) 誰做了註釋:我們調查了關於註釋者的哪些資訊,使觀點主義研究成為可能,並為我們對未來行動的建議奠定基礎。為此,我們討論了適合開發觀點主義模型的資料集(即那些包含個別、非聚合註釋的資料集),並在試驗研究中展示了受控選擇註釋者的重要性。 -##### **EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement** -2502.14260v1 by Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang +##### **MLGym: A New Framework and Benchmark for Advancing AI Research Agents** +2502.14499v1 by Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu -Over the past decade, generative models have achieved significant success in -enhancement fundus images.However, the evaluation of these models still -presents a considerable challenge. A comprehensive evaluation benchmark for -fundus image enhancement is indispensable for three main reasons: 1) The -existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to -downstream real-world clinical research (e.g., Vessel morphology consistency). -2) There is a lack of comprehensive evaluation for both paired and unpaired -enhancement methods, along with the need for expert protocols to accurately -assess clinical value. 3) An ideal evaluation system should provide insights to -inform future developments of fundus image enhancement. To this end, we propose -a novel comprehensive benchmark, EyeBench, to provide insights that align -enhancement models with clinical needs, offering a foundation for future work -to improve the clinical relevance and applicability of generative models for -fundus image enhancement. EyeBench has three appealing properties: 1) -multi-dimensional clinical alignment downstream evaluation: In addition to -evaluating the enhancement task, we provide several clinically significant -downstream tasks for fundus images, including vessel segmentation, DR grading, -denoising generalization, and lesion segmentation. 2) Medical expert-guided -evaluation design: We introduce a novel dataset that promote comprehensive and -fair comparisons between paired and unpaired methods and includes a manual -evaluation protocol by medical experts. 3) Valuable insights: Our benchmark -study provides a comprehensive and rigorous evaluation of existing methods -across different downstream tasks, assisting medical experts in making informed -choices. Additionally, we offer further analysis of the challenges faced by -existing methods. The code is available at -\url{https://github.com/Retinal-Research/EyeBench} +We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for +evaluating and developing LLM agents on AI research tasks. This is the first +Gym environment for machine learning (ML) tasks, enabling research on +reinforcement learning (RL) algorithms for training such agents. MLGym-bench +consists of 13 diverse and open-ended AI research tasks from diverse domains +such as computer vision, natural language processing, reinforcement learning, +and game theory. Solving these tasks requires real-world AI research skills +such as generating new ideas and hypotheses, creating and processing data, +implementing ML methods, training models, running experiments, analyzing the +results, and iterating through this process to improve on a given task. We +evaluate a number of frontier large language models (LLMs) on our benchmarks +such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 +Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate +models or agents, generate synthetic data at scale, as well as develop new +learning algorithms for training agents on AI research tasks. We find that +current frontier models can improve on the given baselines, usually by finding +better hyperparameters, but do not generate novel hypotheses, algorithms, +architectures, or substantial improvements. We open-source our framework and +benchmark to facilitate future research in advancing the AI research +capabilities of LLM agents. -摘要:在過去的十年中,生成模型在增強眼底影像方面取得了顯著的成功。然而,這些模型的評估仍然是一個相當大的挑戰。一個全面的眼底影像增強評估基準對於三個主要原因是不可或缺的:1) 現有的去噪指標(例如 PSNR、SSIM)很難擴展到下游的真實世界臨床研究(例如血管形態一致性)。2) 缺乏對配對和非配對增強方法的全面評估,以及需要專家協議來準確評估臨床價值。3) 一個理想的評估系統應該提供見解,以告知眼底影像增強的未來發展。為此,我們提出了一個新的綜合基準 EyeBench,以提供見解,將增強模型與臨床需求相結合,為未來的研究奠定基礎,以提高生成模型在眼底影像增強方面的臨床相關性和適用性。EyeBench 有三個吸引人的特性:1) 多維臨床對齊下游評估:除了評估增強任務外,我們還為眼底影像提供了幾個臨床上重要的下游任務,包括血管分割、DR 分級、去噪泛化和病灶分割。2) 醫學專家指導的評估設計:我們引入了一個新的數據集,以促進對配對和非配對方法的全面和公平比較,並包括由醫學專家進行的手動評估協議。3) 有價值的見解:我們的基準研究提供了對現有方法在不同下游任務中的全面且嚴格的評估,協助醫學專家做出明智的選擇。此外,我們還進一步分析了現有方法面臨的挑戰。程式碼可在 \url{https://github.com/Retinal-Research/EyeBench} 獲得 +摘要:我們推出 Meta MLGym 和 MLGym-Bench,一個用於評估和開發 AI 研究任務中 LLM 代理的新架構和基準。這是第一個用於機器學習 (ML) 任務的 Gym 環境,可針對訓練此類代理的強化學習 (RL) 演算法進行研究。MLGym-bench 包含 13 項來自不同領域的開放式 AI 研究任務,例如電腦視覺、自然語言處理、強化學習和博弈論。解決這些任務需要實際的 AI 研究技能,例如產生新想法和假設、建立和處理資料、實作 ML 方法、訓練模型、執行實驗、分析結果,並透過此流程反覆運算來改善特定任務。我們在基準上評估許多前沿大型語言模型 (LLM),例如 Claude-3.5-Sonnet、Llama-3.1 405B、GPT-4o、o1-preview 和 Gemini-1.5 Pro。我們的 MLGym 架構讓新增任務、整合和評估模型或代理、大規模產生合成資料,以及開發新的學習演算法以訓練 AI 研究任務中的代理變得容易。我們發現目前的邊界模型可以改善既定的基準,通常是透過尋找更好的超參數,但不會產生新穎的假設、演算法、架構或實質性的改進。我們開放原始碼架構和基準,以促進未來在提升 LLM 代理的 AI 研究能力方面的研究。 -##### **Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning** -2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal +##### **Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups** +2502.14497v1 by Felix Drinkall, Stefan Zohren, Michael McMahon, Janet B. Pierrehumbert -Large language models (LLMs) have achieved remarkable performance in -generating human-like text and solving reasoning tasks of moderate complexity, -such as question-answering and mathematical problem-solving. However, their -capabilities in tasks requiring deeper cognitive skills, such as common-sense -understanding and abstract reasoning, remain under-explored. In this paper, we -systematically evaluate abstract common-sense reasoning in LLMs using the -ConceptNet knowledge graph. We propose two prompting approaches: instruct -prompting, where models predict plausible semantic relationships based on -provided definitions, and few-shot prompting, where models identify relations -using examples as guidance. Our experiments with the gpt-4o-mini model show -that in instruct prompting, consistent performance is obtained when ranking -multiple relations but with substantial decline when the model is restricted to -predicting only one relation. In few-shot prompting, the model's accuracy -improves significantly when selecting from five relations rather than the full -set, although with notable bias toward certain relations. These results suggest -significant gaps still, even in commercially used LLMs' abstract common-sense -reasoning abilities, compared to human-level understanding. However, the -findings also highlight the promise of careful prompt engineering, based on -selective retrieval, for obtaining better performance. +Macroeconomic fluctuations and the narratives that shape them form a mutually +reinforcing cycle: public discourse can spur behavioural changes leading to +economic shifts, which then result in changes in the stories that propagate. We +show that shifts in semantic embedding space can be causally linked to +financial market shocks -- deviations from the expected market behaviour. +Furthermore, we show how partisanship can influence the predictive power of +text for market fluctuations and shape reactions to those same shocks. We also +provide some evidence that text-based signals are particularly salient during +unexpected events such as COVID-19, highlighting the value of language data as +an exogenous variable in economic forecasting. Our findings underscore the +bidirectional relationship between news outlets and market shocks, offering a +novel empirical approach to studying their effect on each other. -摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。 +摘要:宏觀經濟波動與形塑它們的敘事形成一個相互強化的循環:公共論述可能激發導致經濟變化的行為改變,進而導致宣傳故事的改變。我們表明,語義嵌入空間的轉變可能與金融市場震盪(與預期的市場行為的偏差)有因果關係。此外,我們展示了黨派立場如何影響文字對市場波動的預測能力,以及如何形塑對這些震盪的反應。我們還提供了一些證據,證明在 COVID-19 等意外事件期間,基於文字的信號特別顯著,突顯了語言資料在經濟預測中作為外生變數的價值。我們的研究結果強調了新聞媒體與市場震盪之間的雙向關係,提供了一種研究它們對彼此影響的新穎實證方法。 -##### **Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging** -2502.14064v1 by Shansong Wang, Mojtaba Safari, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang +##### **Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization** +2502.14496v1 by Zhitao He, Zijun Liu, Peng Li, May Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu -Vision foundation models (VFMs) are pre-trained on extensive image datasets -to learn general representations for diverse types of data. These models can -subsequently be fine-tuned for specific downstream tasks, significantly -boosting performance across a broad range of applications. However, existing -vision foundation models that claim to be applicable to various radiology tasks -are mostly pre-trained on 3D computed tomography (CT), which benefits from the -availability of extensive 3D CT databases. Significant differences between CT -and magnetic resonance imaging (MRI) in imaging principles, signal -characteristics, and data distribution may hinder their practical performance -and versatility in MRI-specific applications. Here, we propose Triad, a vision -foundation model for 3D MRI. Triad adopts a widely used autoencoder -architecture to learn robust representations from 131,170 3D MRI volumes and -uses organ-independent imaging descriptions to constrain the semantic -distribution of the visual modality. The above pre-training dataset is called -Triad-131K, which is currently the largest 3D MRI pre-training dataset. We -evaluate Triad across three tasks, namely, organ/tumor segmentation, -organ/cancer classification, and medical image registration, in two data -modalities (within-domain and out-of-domain) settings using 25 downstream -datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad -improves segmentation performance by 6.88% compared to nnUNet-Scratch across 17 -datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in -classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% -compared to SwinUNETR-Scratch in registration tasks across two datasets. Our -study demonstrates that pre-training can maximize performance when the data -modalities and organs of upstream and downstream tasks are consistent. +LLM-based agents have made significant advancements in interactive +environments, such as mobile operations and web browsing, and other domains +beyond computer using. Current multi-agent systems universally excel in +performance, compared to single agents, but struggle with generalization across +environments due to predefined roles and inadequate strategies for generalizing +language agents. The challenge of achieving both strong performance and good +generalization has hindered the progress of multi-agent systems for interactive +environments. To address these issues, we propose CollabUIAgents, a multi-agent +reinforcement learning framework with a novel multi-agent credit re-assignment +(CR) strategy, assigning process rewards with LLMs rather than +environment-specific rewards and learning with synthesized preference data, in +order to foster generalizable, collaborative behaviors among the role-free +agents' policies. Empirical results show that our framework improves both +performance and cross-environment generalizability of multi-agent systems. +Moreover, our 7B-parameter system achieves results on par with or exceed strong +closed-source models, and the LLM that guides the CR. We also provide insights +in using granular CR rewards effectively for environment generalization, and +accommodating trained LLMs in multi-agent systems. -摘要:視覺基礎模型 (VFM) 在廣泛的影像資料集上進行預訓練,以學習各種資料類型的通用表示。這些模型隨後可以針對特定的下游任務進行微調,大幅提升各種應用程式的效能。然而,現有的視覺基礎模型聲稱適用於各種放射學任務,但大多是針對 3D 電腦斷層攝影 (CT) 進行預訓練,這得利於廣泛的 3D CT 資料庫。CT 和磁振造影 (MRI) 在影像原理、訊號特性和資料分佈上的顯著差異,可能會阻礙其在 MRI 特定應用中的實際效能和多功能性。在此,我們提出 Triad,一個適用於 3D MRI 的視覺基礎模型。Triad 採用廣泛使用的自動編碼器架構,從 131,170 個 3D MRI 體積中學習穩健的表示,並使用與器官無關的影像描述來約束視覺模式的語義分佈。上述預訓練資料集稱為 Triad-131K,目前是最大的 3D MRI 預訓練資料集。我們在三個任務中評估 Triad,即器官/腫瘤分割、器官/癌症分類和醫學影像配準,在兩個資料模式(域內和域外)設定中使用 25 個下游資料集。透過使用 Triad 的預訓練權重初始化模型,nnUNet-Triad 在 17 個資料集中的分割效能比 nnUNet-Scratch 提升了 6.88%。Swin-B-Triad 在五個資料集的分類任務中,比 Swin-B-Scratch 提升了 3.97%。SwinUNETR-Triad 在兩個資料集的配準任務中,比 SwinUNETR-Scratch 提升了 4.00%。我們的研究證明,當上游和下游任務的資料模式和器官一致時,預訓練可以最大化效能。 +摘要:基於 LLM 的代理在互動式環境中取得重大進展,例如行動運算和網頁瀏覽,以及電腦使用以外的其他領域。與單一代理相比,目前的 Multi-Agent 系統在效能上普遍表現出色,但由於預先定義的角色和不適當的語言代理概化策略,導致難以跨環境概化。在互動式環境中,同時達成強大效能和良好概化的挑戰,阻礙了 Multi-Agent 系統的進展。為了解決這些問題,我們提出 CollabUIAgents,這是一個 Multi-Agent 強化學習架構,具備創新的 Multi-Agent 信用重新分配 (CR) 策略,使用 LLM 而不是特定於環境的獎勵來分配程序獎勵,並透過綜合偏好資料進行學習,以促進無角色代理政策之間可概化的協作行為。經驗結果顯示,我們的架構同時改善了 Multi-Agent 系統的效能和跨環境概化能力。此外,我們的 7B 參數系統在效能上與強大的閉源模型和引導 CR 的 LLM 相當或超越它們。我們也提供見解,說明如何有效地使用細粒化的 CR 獎勵來進行環境概化,以及如何在 Multi-Agent 系統中容納受過訓練的 LLM。 -##### **VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare** -2502.13775v1 by Anudeex Shetty, Amin Beheshti, Mark Dras, Usman Naseem +##### **StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following** +2502.14494v1 by Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu -Alignment techniques have become central to ensuring that Large Language -Models (LLMs) generate outputs consistent with human values. However, existing -alignment paradigms often model an averaged or monolithic preference, failing -to account for the diversity of perspectives across cultures, demographics, and -communities. This limitation is particularly critical in health-related -scenarios, where plurality is essential due to the influence of culture, -religion, personal values, and conflicting opinions. Despite progress in -pluralistic alignment, no prior work has focused on health, likely due to the -unavailability of publicly available datasets. To address this gap, we -introduce VITAL, a new benchmark dataset comprising 13.1K value-laden -situations and 5.4K multiple-choice questions focused on health, designed to -assess and benchmark pluralistic alignment methodologies. Through extensive -evaluation of eight LLMs of varying sizes, we demonstrate that existing -pluralistic alignment techniques fall short in effectively accommodating -diverse healthcare beliefs, underscoring the need for tailored AI alignment in -specific domains. This work highlights the limitations of current approaches -and lays the groundwork for developing health-specific alignment solutions. +Multi-turn instruction following capability constitutes a core competency of +large language models (LLMs) in real-world applications. Existing evaluation +benchmarks predominantly focus on fine-grained constraint satisfaction and +domain-specific capability assessment, yet overlook the crucial structural +dependency between dialogue turns that distinguishes multi-turn from +single-turn interactions. This structural dependency not only reflects user +intent but also establishes a second dimension for instruction following +evaluation beyond constraint satisfaction. To address this gap, we propose +StructFlowBench, a multi-turn instruction following benchmark with structural +flow modeling. The benchmark innovatively defines a structural flow framework +comprising six fundamental inter-turn relationships, which not only introduces +novel structural constraints for model evaluation but also serves as generation +parameters for creating customized dialogue flows tailored to specific +scenarios. Adopting established LLM-based automatic evaluation methodologies, +we conduct systematic evaluations of 13 leading open-source and closed-source +LLMs. Experimental results reveal significant deficiencies in current models' +comprehension of multi-turn dialogue structures. The code is available at +\url{https://github.com/MLGroupJLU/StructFlowBench}. -摘要:對齊技術已成為確保大型語言模型 (LLM) 產生與人類價值觀一致的輸出的核心。然而,現有的對齊範例通常會建模平均或單一的偏好,無法考量跨文化、人口統計和社群的不同觀點。此限制在與健康相關的場景中特別重要,因為在這種場景中,由於文化、宗教、個人價值觀和相互衝突的意見的影響,多元性是必要的。儘管多元對齊已取得進展,但沒有任何先前的工作專注於健康,這可能是因為缺乏公開可用的資料集。為了解決此差距,我們引入了 VITAL,這是一個新的基準資料集,包含 13.1K 個價值觀念的情境和 5.4K 個選擇題,專注於健康,旨在評估和基準多元對齊方法。透過對八個不同規模的 LLM 進行廣泛評估,我們證明現有的多元對齊技術無法有效適應不同的醫療保健信念,這強調了在特定領域中需要量身打造的 AI 對齊。這項工作突顯了當前方法的限制,並為開發特定於健康的對齊解決方案奠定了基礎。 +摘要:多輪指令遵循能力構成大型語言模型 (LLM) 在現實世界應用中的核心能力。現有的評估基準主要專注於細粒度的約束滿足和特定領域的能力評估,卻忽略了多輪與單輪互動之間區別的關鍵結構依賴性。這種結構依賴性不僅反映了使用者的意圖,也為指令遵循評估建立了超越約束滿足的第二個維度。為了解決這個差距,我們提出了 StructFlowBench,一個具有結構流建模的多輪指令遵循基準。該基準創新地定義了一個結構流框架,包含六個基本的回合間關係,這不僅引入了模型評估的新結構約束,還可用作生成參數,用於創建針對特定場景定制的對話流。採用已建立的基於 LLM 的自動評估方法,我們對 13 個領先的開源和閉源 LLM 進行了系統評估。實驗結果揭示了當前模型在理解多輪對話結構方面存在顯著缺陷。程式碼可在 \url{https://github.com/MLGroupJLU/StructFlowBench} 取得。 -##### **PeerQA: A Scientific Question Answering Dataset from Peer Reviews** -2502.13668v1 by Tim Baumgärtner, Ted Briscoe, Iryna Gurevych +##### **Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk** +2502.14491v1 by Elija Perrier -We present PeerQA, a real-world, scientific, document-level Question -Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, -which contain questions that reviewers raised while thoroughly examining the -scientific article. Answers have been annotated by the original authors of each -paper. The dataset contains 579 QA pairs from 208 academic articles, with a -majority from ML and NLP, as well as a subset of other scientific communities -like Geoscience and Public Health. PeerQA supports three critical tasks for -developing practical QA systems: Evidence retrieval, unanswerable question -classification, and answer generation. We provide a detailed analysis of the -collected dataset and conduct experiments establishing baseline systems for all -three tasks. Our experiments and analyses reveal the need for -decontextualization in document-level retrieval, where we find that even simple -decontextualization approaches consistently improve retrieval performance -across architectures. On answer generation, PeerQA serves as a challenging -benchmark for long-context modeling, as the papers have an average size of 12k -tokens. Our code and data is available at https://github.com/UKPLab/peerqa. +Evaluating AI safety requires statistically rigorous methods and risk metrics +for understanding how the use of AI affects aggregated risk. However, much AI +safety literature focuses upon risks arising from AI models in isolation, +lacking consideration of how modular use of AI affects risk distribution of +workflow components or overall risk metrics. There is also a lack of +statistical grounding enabling sensitisation of risk models in the presence of +absence of AI to estimate causal contributions of AI. This is in part due to +the dearth of AI impact data upon which to fit distributions. In this work, we +address these gaps in two ways. First, we demonstrate how scenario modelling +(grounded in established statistical techniques such as Markov chains, copulas +and Monte Carlo simulation) can be used to model AI risk holistically. Second, +we show how lookalike distributions from phenomena analogous to AI can be used +to estimate AI impacts in the absence of directly observable data. We +demonstrate the utility of our methods for benchmarking cumulative AI risk via +risk analysis of a logistic scenario simulations. -摘要:我們提出 PeerQA,一個真實世界、科學的、文件層級的問答 (QA) 資料集。PeerQA 問題來自於同行評審,其中包含審查者在徹底審查科學文章時提出的問題。答案是由每篇論文的原始作者註解的。此資料集包含來自 208 篇學術文章的 579 個 QA 對,其中大部分來自 ML 和 NLP,以及其他科學社群(例如地球科學和公共衛生)的子集。PeerQA 支援開發實用 QA 系統的三項重要任務:證據檢索、無解答問題分類和答案產生。我們提供收集到的資料集的詳細分析,並進行實驗,為所有三項任務建立基準系統。我們的實驗和分析揭示了在文件層級檢索中去脈絡化的必要性,我們發現即使是簡單的去脈絡化方法也能持續改善跨架構的檢索效能。在答案產生方面,PeerQA 是一個用於長脈絡建模的具挑戰性基準,因為論文的平均大小為 12k 個符號。我們的程式碼和資料可於 https://github.com/UKPLab/peerqa 取得。 +摘要:評估 AI 安全性需要嚴格的統計方法和風險指標,以了解 AI 的使用如何影響累積風險。然而,許多 AI 安全性文獻著重於 AI 模型孤立產生的風險,缺乏考量 AI 的模組化使用如何影響工作流程組件的風險分佈或整體風險指標。在有或沒有 AI 的情況下,統計基礎也缺乏讓風險模型敏感化的能力,以估計 AI 的因果關係貢獻。這部分是因為缺乏 AI 影響資料來擬合分佈。在這項研究中,我們以兩種方式解決這些差距。首先,我們展示情境建模(建立在已建立的統計技術上,例如馬可夫鏈、copula 和蒙地卡羅模擬)如何用於整體建模 AI 風險。其次,我們展示如何使用類似於 AI 現象的相似分佈來估計在沒有直接可觀察資料的情況下 AI 的影響。我們透過後勤情境模擬的風險分析,展示了我們的方法對於評量累積 AI 風險的效用。 -##### **Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs** -2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu +##### **Temporal Misalignment and Probabilistic Neurons** +2502.14487v1 by Velibor Bojković, Xiaofeng Wu, Bin Gu -Data augmentation is necessary for graph representation learning due to the -scarcity and noise present in graph data. Most of the existing augmentation -methods overlook the context information inherited from the dataset as they -rely solely on the graph structure for augmentation. Despite the success of -some large language model-based (LLM) graph learning methods, they are mostly -white-box which require access to the weights or latent features from the -open-access LLMs, making them difficult to be democratized for everyone as -existing LLMs are mostly closed-source for commercial considerations. To -overcome these limitations, we propose a black-box context-driven graph data -augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the -text prompt as context-related information, we task the LLM with generating -knowledge graphs (KGs), which allow us to capture the structural interactions -from the text outputs. We then design a dynamic merging schema to -stochastically integrate the LLM-generated KGs into the original graph during -training. To control the sparsity of the augmented graph, we further devise a -granularity-aware prompting strategy and an instruction fine-tuning module, -which seamlessly generates text prompts according to different granularity -levels of the dataset. Extensive experiments on various graph learning tasks -validate the effectiveness of our method over existing graph data augmentation -methods. Notably, our approach excels in scenarios involving electronic health -records (EHRs), which validates its maximal utilization of contextual -knowledge, leading to enhanced predictive performance and interpretability. +Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to +Artificial Neural Networks (ANNs) by mimicking biological neural principles, +establishing them as a promising approach to mitigate the increasing energy +demands of large-scale neural models. However, fully harnessing the +capabilities of SNNs remains challenging due to their discrete signal +processing and temporal dynamics. ANN-SNN conversion has emerged as a practical +approach, enabling SNNs to achieve competitive performance on complex machine +learning tasks. In this work, we identify a phenomenon in the ANN-SNN +conversion framework, termed temporal misalignment, in which random spike +rearrangement across SNN layers leads to performance improvements. Based on +this observation, we introduce biologically plausible two-phase probabilistic +(TPP) spiking neurons, further enhancing the conversion process. We demonstrate +the advantages of our proposed method both theoretically and empirically +through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet +across a variety of architectures, achieving state-of-the-art results. -摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。 +摘要:脈衝神經網路 (SNN) 模仿生物神經原理,提供了一種比人工神經網路 (ANN) 更省能的替代方案,確立了它們作為緩解大型神經模型日益增長能耗需求的一種有前途的方法。然而,由於 SNN 的離散訊號處理和時間動態,要充分利用 SNN 的功能仍然具有挑戰性。ANN-SNN 轉換已經成為一種實用的方法,使 SNN 能夠在複雜機器學習任務中實現競爭性能。在這項工作中,我們在 ANN-SNN 轉換框架中發現了一種現象,稱為時間錯位,其中隨機脈衝在 SNN 層之間重新排列會導致性能提升。基於這一觀察,我們引入了生物學上合理的兩階段機率 (TPP) 脈衝神經元,進一步增強了轉換過程。我們通過在 CIFAR-10/100、CIFAR10-DVS 和 ImageNet 上對各種架構進行綜合實驗,從理論和經驗上證明了我們提出的方法的優點,取得了最先進的結果。 -##### **MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis** -2502.13524v1 by Wei Dai, Steven Wang, Jun Liu +##### **How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation** +2502.14486v1 by Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang, Zhongyu Wei -Efficient evaluation of three-dimensional (3D) medical images is crucial for -diagnostic and therapeutic practices in healthcare. Recent years have seen a -substantial uptake in applying deep learning and computer vision to analyse and -interpret medical images. Traditional approaches, such as convolutional neural -networks (CNNs) and vision transformers (ViTs), face significant computational -challenges, prompting the need for architectural advancements. Recent efforts -have led to the introduction of novel architectures like the ``Mamba'' model as -alternative solutions to traditional CNNs or ViTs. The Mamba model excels in -the linear processing of one-dimensional data with low computational demands. -However, Mamba's potential for 3D medical image analysis remains underexplored -and could face significant computational challenges as the dimension increases. -This manuscript presents MobileViM, a streamlined architecture for efficient -segmentation of 3D medical images. In the MobileViM network, we invent a new -dimension-independent mechanism and a dual-direction traversing approach to -incorporate with a vision-Mamba-based framework. MobileViM also features a -cross-scale bridging technique to improve efficiency and accuracy across -various medical imaging modalities. With these enhancements, MobileViM achieves -segmentation speeds exceeding 90 frames per second (FPS) on a single graphics -processing unit (i.e., NVIDIA RTX 4090). This performance is over 24 FPS faster -than the state-of-the-art deep learning models for processing 3D images with -the same computational resources. In addition, experimental evaluations -demonstrate that MobileViM delivers superior performance, with Dice similarity -scores reaching 92.72%, 86.69%, 80.46%, and 77.43% for PENGWIN, BraTS2024, -ATLAS, and Toothfairy2 datasets, respectively, which significantly surpasses -existing models. +Jailbreak attacks, where harmful prompts bypass generative models' built-in +safety, raise serious concerns about model vulnerability. While many defense +methods have been proposed, the trade-offs between safety and helpfulness, and +their application to Large Vision-Language Models (LVLMs), are not well +understood. This paper systematically examines jailbreak defenses by reframing +the standard generation task as a binary classification problem to assess model +refusal tendencies for both harmful and benign queries. We identify two key +defense mechanisms: safety shift, which increases refusal rates across all +queries, and harmfulness discrimination, which improves the model's ability to +distinguish between harmful and benign inputs. Using these mechanisms, we +develop two ensemble defense strategies-inter-mechanism ensembles and +intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the +MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these +strategies effectively improve model safety or optimize the trade-off between +safety and helpfulness. -摘要:有效評估三維 (3D) 醫學影像對於醫療保健中的診斷和治療實務至關重要。近年來,將深度學習和電腦視覺應用於分析和詮釋醫學影像的應用大幅增加。傳統方法,例如卷積神經網路 (CNN) 和視覺Transformer (ViT),面臨重大的運算挑戰,促使需要架構上的進步。最近的努力已導致引進創新的架構,例如「Mamba」模型,作為傳統 CNN 或 ViT 的替代解決方案。Mamba 模型擅長以低運算需求進行一維資料的線性處理。然而,Mamba 在 3D 醫學影像分析方面的潛力仍未被充分探索,並且隨著維度的增加可能會面臨重大的運算挑戰。本手稿提出 MobileViM,這是一種簡化的架構,可有效分割 3D 醫學影像。在 MobileViM 網路中,我們發明了一種新的與維度無關的機制和雙向遍歷方法,以與基於視覺 Mamba 的架構結合。MobileViM 還具備跨尺度橋接技術,以提高各種醫學影像模式的效率和準確性。透過這些增強功能,MobileViM 在單一顯示卡 (即 NVIDIA RTX 4090) 上達到了每秒超過 90 幀 (FPS) 的分割速度。此效能比現有最先進的深度學習模型快了超過 24 FPS,這些模型使用相同的運算資源處理 3D 影像。此外,實驗評估證明 MobileViM 提供了卓越的效能,Dice 相似性評分對於 PENGWIN、BraTS2024、ATLAS 和 Toothfairy2 資料集分別達到 92.72%、86.69%、80.46% 和 77.43%,顯著超越現有模型。 +摘要:越獄攻擊,其中有害提示繞過生成模型內建的安全機制,引發了對模型漏洞的嚴重疑慮。雖然已提出許多防禦方法,但安全性與有益性之間的取捨,以及它們在大型視覺語言模型 (LVLMs) 中的應用,尚未得到充分理解。本文透過將標準生成任務重新定義為二元分類問題,系統性地檢視越獄防禦,以評估模型對有害和良性查詢的拒絕傾向。我們找出兩種關鍵的防禦機制:安全轉移,這會提高所有查詢的拒絕率,以及危害區分,這會提升模型區分有害和良性輸入的能力。使用這些機制,我們開發出兩種整體防禦策略,機制間整體和機制內整體,以平衡安全性與有益性。在使用 LLaVA-1.5 模型的 MM-SafetyBench 和 MOSSBench 資料集上進行的實驗顯示,這些策略有效地提升了模型安全性,或最佳化了安全性與有益性之間的取捨。 -##### **Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion** -2502.13509v1 by Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang +##### **NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models** +2502.14482v1 by Chenlu Guo, Yuan Wu, Yi Chang -Large language models (LLMs) have shown remarkable performance in -vision-language tasks, but their application in the medical field remains -underexplored, particularly for integrating structured time series data with -unstructured clinical notes. In clinical practice, dynamic time series data -such as lab test results capture critical temporal patterns, while clinical -notes provide rich semantic context. Merging these modalities is challenging -due to the inherent differences between continuous signals and discrete text. -To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal -framework that employs prompt-guided learning to unify these heterogeneous data -types. Our approach leverages lightweight anomaly detection to generate anomaly -captions that serve as prompts, guiding the encoding of raw time series data -into informative embeddings. These embeddings are aligned with textual -representations in a shared latent space, preserving fine-grained temporal -nuances alongside semantic insights. Furthermore, our framework incorporates -tailored self-supervised objectives to enhance both intra- and inter-modal -alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world -datasets, and the results demonstrate that our method consistently outperforms -state-of-the-art approaches. +Parameter-efficient fine-tuning (PEFT) is essential for adapting large +language models (LLMs), with low-rank adaptation (LoRA) being the most popular +approach. However, LoRA suffers from slow convergence, and some recent LoRA +variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD) +for initialization, leading to expensive computation. To mitigate these +problems, we use the Nystr\"om method, which follows a three-matrix +manipulation. We first introduce StructuredLoRA (SLoRA), which investigates +adding a small intermediate matrix between the low-rank matrices A and B. +Secondly, we propose Nystr\"omLoRA (NLoRA), which leverages Nystr\"om-based +initialization for SLoRA to improve its effectiveness and efficiency. Finally, +we propose IntermediateTune (IntTune), which explores fine-tuning exclusively +on the intermediate matrix of NLoRA to further boost LLM efficiency. We +evaluate our methods on five natural language generation (NLG) tasks and eight +natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve +accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with +only 3.67 million additional trainable parameters. IntTune improves average NLG +performance over LoRA by 7.45% while using only 1.25% of its parameters. These +results demonstrate the efficiency and effectiveness of our approach in +enhancing model performance with minimal parameter overhead. -摘要:大型語言模型(LLM)在視覺語言任務中表現出色,但其在醫療領域的應用仍未得到充分探索,特別是在將結構化時間序列數據與非結構化臨床筆記整合方面。在臨床實務中,動態時間序列數據(例如實驗室檢驗結果)會擷取關鍵的時間模式,而臨床筆記則提供豐富的語意脈絡。由於連續訊號與離散文字之間的固有差異,合併這些方式具有挑戰性。為了彌補這個差距,我們引入了 ProMedTS,這是一個新穎的自監督多模態框架,採用提示引導學習來統一這些異質化的數據類型。我們的做法利用輕量級異常偵測來產生異常標題,作為提示,引導將原始時間序列數據編碼成資訊性的嵌入。這些嵌入與共享潛在空間中的文字表示對齊,同時保留細微的時間差異和語意見解。此外,我們的框架納入了客製化的自監督目標,以增強模態內和模態間對齊。我們在疾病診斷任務中使用真實世界的數據集評估 ProMedTS,結果表明,我們的模型始終優於最先進的方法。 +摘要:參數高效微調 (PEFT) 對於調整大型語言模型 (LLM) 至關重要,其中低秩調整 (LoRA) 是最受歡迎的方法。然而,LoRA 存在收斂速度慢的問題,而一些最近的 LoRA 變體,例如 PiSSA,主要依賴奇異值分解 (SVD) 進行初始化,導致運算成本高昂。為了減輕這些問題,我們使用了 Nystr\"om 方法,它遵循三矩陣操作。我們首先介紹 StructuredLoRA (SLoRA),它研究在低秩矩陣 A 和 B 之間添加一個小的中間矩陣。其次,我們提出了 Nystr\"omLoRA (NLoRA),它利用基於 Nystr\"om 的初始化方法為 SLoRA 提升其有效性和效率。最後,我們提出了 IntermediateTune (IntTune),它探討了僅對 NLoRA 的中間矩陣進行微調,以進一步提升 LLM 效率。我們在五項自然語言生成 (NLG) 任務和八項自然語言理解 (NLU) 任務上評估了我們的這些方法。在 GSM8K 上,SLoRA 和 NLoRA 分別達到了 56.48% 和 57.70% 的準確率,比 LoRA 高出 33.52% 和 36.41%,而僅增加了 367 萬個可訓練參數。IntTune 在僅使用 LoRA 1.25% 的參數的情況下,將平均 NLG 效能提升了 7.45%。這些結果證明了我們的方法在以最少的參數開銷提升模型效能方面的效率和有效性。 -##### **Towards a perturbation-based explanation for medical AI as differentiable programs** -2502.14001v1 by Takeshi Abe, Yoshiyuki Asai +##### **Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression** +2502.14477v1 by Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang -Recent advancement in machine learning algorithms reaches a point where -medical devices can be equipped with artificial intelligence (AI) models for -diagnostic support and routine automation in clinical settings. In medicine and -healthcare, there is a particular demand for sufficient and objective -explainability of the outcome generated by AI models. However, AI models are -generally considered as black boxes due to their complexity, and the -computational process leading to their response is often opaque. Although -several methods have been proposed to explain the behavior of models by -evaluating the importance of each feature in discrimination and prediction, -they may suffer from biases and opacities arising from the scale and sampling -protocol of the dataset used for training or testing. To overcome the -shortcomings of existing methods, we explore an alternative approach to provide -an objective explanation of AI models that can be defined independently of the -learning process and does not require additional data. As a preliminary study -for this direction of research, this work examines a numerical availability of -the Jacobian matrix of deep learning models that measures how stably a model -responses against small perturbations added to the input. The indicator, if -available, are calculated from a trained AI model for a given target input. -This is a first step towards a perturbation-based explanation, which will -assist medical practitioners in understanding and interpreting the response of -the AI model in its clinical application. +Handling long-context sequences efficiently remains a significant challenge +in large language models (LLMs). Existing methods for token selection in +sequence extrapolation either employ a permanent eviction strategy or select +tokens by chunk, which may lead to the loss of critical information. We propose +Efficient Selective Attention (ESA), a novel approach that extends context +length by efficiently selecting the most critical tokens at the token level to +compute attention. ESA reduces the computational complexity of token selection +by compressing query and key vectors into lower-dimensional representations. We +evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using +open-source LLMs with context lengths of 8k and 32k. ESA outperforms other +selective attention methods, especially in tasks requiring the retrieval of +multiple pieces of information, achieving comparable performance to +full-attention extrapolation methods across various tasks, with superior +results in certain tasks. -摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 +摘要:在大型語言模型 (LLM) 中,有效處理長語境序列仍然是一項重大挑戰。現有的序列外推標記選擇方法採用永久驅逐策略或按塊選擇標記,這可能會導致關鍵資訊遺失。我們提出高效選擇性注意 (ESA),這是一種新穎的方法,它透過在標記層級有效選擇最關鍵的標記來計算注意,從而延伸語境長度。ESA 透過將查詢和關鍵向量壓縮成較低維度的表示,來降低標記選擇的運算複雜度。我們使用開放原始碼 LLM,在語境長度為 8k 和 32k 的情況下,對長序列基準進行評估,最大長度達 256k。ESA 的表現優於其他選擇性注意方法,特別是在需要擷取多條資訊的任務中,在各種任務中達到與全注意外推方法相當的效能,並且在某些任務中獲得更佳的結果。 -##### **RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering** -2502.13361v1 by Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou +##### **Argument-Based Comparative Question Answering Evaluation Benchmark** +2502.14476v1 by Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann -Medical question answering requires extensive access to specialized -conceptual knowledge. The current paradigm, Retrieval-Augmented Generation -(RAG), acquires expertise medical knowledge through large-scale corpus -retrieval and uses this knowledge to guide a general-purpose large language -model (LLM) for generating answers. However, existing retrieval approaches -often overlook the importance of factual knowledge, which limits the relevance -of retrieved conceptual knowledge and restricts its applicability in real-world -scenarios, such as clinical decision-making based on Electronic Health Records -(EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval -framework that retrieves both relevant factual and conceptual knowledge from -dual sources (i.e., EHRs and the corpus), allowing them to interact and refine -each another. Through extensive evaluation across three factual-aware medical -question answering benchmarks, RGAR establishes a new state-of-the-art -performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model -with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings -demonstrate the benefit of extracting factual knowledge for retrieval, which -consistently yields improved generation quality. +In this paper, we aim to solve the problems standing in the way of automatic +comparative question answering. To this end, we propose an evaluation framework +to assess the quality of comparative question answering summaries. We formulate +15 criteria for assessing comparative answers created using manual annotation +and annotation from 6 large language models and two comparative question +asnwering datasets. We perform our tests using several LLMs and manual +annotation under different settings and demonstrate the constituency of both +evaluations. Our results demonstrate that the Llama-3 70B Instruct model +demonstrates the best results for summary evaluation, while GPT-4 is the best +for answering comparative questions. All used data, code, and evaluation +results are publicly +available\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}. -摘要:醫療問題解答需要大量取得專業概念知識。目前的典範,檢索增強生成(RAG),透過大規模語料庫檢索取得專業醫療知識,並使用此知識引導通用大型語言模型(LLM)來產生答案。然而,現有的檢索方法經常忽略事實知識的重要性,這會限制檢索到的概念知識的相關性,並限制其在現實世界情境中的適用性,例如基於電子健康記錄(EHR)的臨床決策制定。本文介紹 RGAR,一個遞迴生成增強檢索架構,從雙重來源(即 EHR 和語料庫)檢索相關的事實和概念知識,讓它們互動並互相精煉。透過在三個事實感知醫療問題解答基準上進行廣泛評估,RGAR 在醫療 RAG 系統中建立了新的最先進效能。值得注意的是,採用 RGAR 的 Llama-3.1-8B-Instruct 模型超越了規模大得多的 RAG 增強型 GPT-3.5。我們的研究結果證明了提取事實知識以進行檢索的好處,這會持續產生改善的生成品質。 +摘要:在本文中,我們旨在解決阻礙自動比較性問題解答的難題。為此,我們提出一個評估框架,用於評估比較性問題解答摘要的品質。我們制定了 15 項準則,用於評估使用手動標註和來自 6 個大型語言模型和兩個比較性問題解答資料集的標註所建立的比較性答案。我們在不同的設定下使用幾個 LLM 和手動標註執行測試,並展示兩種評估的組成。我們的結果表明,Llama-3 70B Instruct 模型在摘要評估中表現最佳,而 GPT-4 在回答比較性問題方面表現最佳。所有使用過的資料、程式碼和評估結果均公開可用\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}。 -##### **Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance** -2502.13321v1 by Tejas Srinivasan, Jesse Thomason +##### **Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models** +2502.14469v1 by Aurora Polo-Rodríguez, Laura Fiorini, Erika Rovini, Filippo Cavallo, Javier Medina-Quero -Trust biases how users rely on AI recommendations in AI-assisted -decision-making tasks, with low and high levels of trust resulting in increased -under- and over-reliance, respectively. We propose that AI assistants should -adapt their behavior through trust-adaptive interventions to mitigate such -inappropriate reliance. For instance, when user trust is low, providing an -explanation can elicit more careful consideration of the assistant's advice by -the user. In two decision-making scenarios -- laypeople answering science -questions and doctors making medical diagnoses -- we find that providing -supporting and counter-explanations during moments of low and high trust, -respectively, yields up to 38% reduction in inappropriate reliance and 20% -improvement in decision accuracy. We are similarly able to reduce over-reliance -by adaptively inserting forced pauses to promote deliberation. Our results -highlight how AI adaptation to user trust facilitates appropriate reliance, -presenting exciting avenues for improving human-AI collaboration. +This work presents a novel architecture for context-aware interactions within +smart environments, leveraging Large Language Models (LLMs) to enhance user +experiences. Our system integrates user location data obtained through UWB tags +and sensor-equipped smart homes with real-time human activity recognition (HAR) +to provide a comprehensive understanding of user context. This contextual +information is then fed to an LLM-powered chatbot, enabling it to generate +personalised interactions and recommendations based on the user's current +activity and environment. This approach moves beyond traditional static chatbot +interactions by dynamically adapting to the user's real-time situation. A case +study conducted from a real-world dataset demonstrates the feasibility and +effectiveness of our proposed architecture, showcasing its potential to create +more intuitive and helpful interactions within smart homes. The results +highlight the significant benefits of integrating LLM with real-time activity +and location data to deliver personalised and contextually relevant user +experiences. -摘要:信任偏見影響使用者在 AI 輔助決策任務中如何依賴 AI 建議,信任程度低和高分別導致依賴不足和過度依賴。我們建議 AI 助理應透過信任適應式干預調整其行為,以減輕這種不適當的依賴。例如,當使用者信任度低時,提供解釋可以引發使用者更仔細地考慮助理的建議。在兩種決策情境中——外行人回答科學問題和醫生進行醫療診斷——我們發現,分別在信任度低和高的時刻提供支持性和反向解釋,可以將不適當的依賴降低多達 38%,並將決策準確性提高 20%。我們同樣能夠透過適應性地插入強制暫停來促進審議,以減少過度依賴。我們的結果強調 AI 如何適應使用者信任以促進適當的依賴,為改善人機協作提供了令人興奮的途徑。 +摘要:本研究提出了一種創新的架構,用於在智慧環境中進行情境感知互動,利用大型語言模型 (LLM) 來提升使用者體驗。我們的系統整合了透過超寬頻標籤取得的使用者位置資料,以及配備感測器的智慧家庭,並具備即時人類活動辨識 (HAR),以全面了解使用者的情境。接著,將這些情境資訊輸入 LLM 驅動的聊天機器人,讓它能根據使用者的當前活動和環境產生個人化的互動和建議。這種方法超越了傳統的靜態聊天機器人互動,能動態地適應使用者的即時狀況。從真實世界資料集進行的案例研究,展示了我們提出的架構的可行性和有效性,突顯出它在智慧家庭中創造更直覺且有用的互動的潛力。結果突顯了將 LLM 與即時活動和位置資料整合,以提供個人化且與情境相關的使用者體驗的顯著優點。 -##### **Prediction of Clinical Complication Onset using Neural Point Processes** -2502.13290v1 by Sachini Weerasekara, Sagar Kamarthi, Jacqueline Isaacs +##### **Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing** +2502.14458v1 by Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, Albert Gu -Predicting medical events in advance within critical care settings is -paramount for patient outcomes and resource management. Utilizing predictive -models, healthcare providers can anticipate issues such as cardiac arrest, -sepsis, or respiratory failure before they manifest. Recently, there has been a -surge in research focusing on forecasting adverse medical event onsets prior to -clinical manifestation using machine learning. However, while these models -provide temporal prognostic predictions for the occurrence of a specific -adverse event of interest within defined time intervals, their interpretability -often remains a challenge. In this work, we explore the applicability of neural -temporal point processes in the context of adverse event onset prediction, with -the aim of explaining clinical pathways and providing interpretable insights. -Our experiments span six state-of-the-art neural point processes and six -critical care datasets, each focusing on the onset of distinct adverse events. -This work represents a novel application class of neural temporal point -processes in event prediction. +We introduce Llamba, a family of efficient recurrent language models +distilled from Llama-3.x into the Mamba architecture. The series includes +Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput +and handle significantly larger batch sizes than Transformer-based models while +maintaining comparable benchmark performance. Furthermore, Llamba demonstrates +the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., +2024), achieving these results with less than 0.1% of the training data +typically used for models of similar size. To take full advantage of their +efficiency, we provide an optimized implementation of Llamba for +resource-constrained devices such as smartphones and edge platforms, offering a +practical and memory-efficient alternative to Transformers. Overall, Llamba +improves the tradeoff between speed, memory efficiency, and performance, making +high-quality language models more accessible. -摘要:在重症監護環境中預先預測醫療事件對於患者的預後和資源管理至關重要。利用預測模型,醫療保健提供者可以在心臟驟停、敗血症或呼吸衰竭等問題發生之前預測到這些問題。最近,專注於在臨床表現之前使用機器學習預測不良醫療事件發生的研究激增。然而,儘管這些模型為特定不良事件在定義的時間間隔內發生提供了時間預後預測,但它們的可解釋性仍然是一個挑戰。在這項工作中,我們探討了神經時間點過程在不良事件發作預測中的適用性,目的是解釋臨床途徑並提供可解釋的見解。我們的實驗涵蓋了六種最先進的神經點過程和六個重症監護資料集,每個資料集都專注於不同不良事件的發作。這項工作代表了神經時間點過程在事件預測中的一種新的應用類別。 +摘要:我們推出 Llamba,一種高效的遞迴語言模型家族,從 Llama-3.x 萃取到 Mamba 架構中。該系列包含 Llamba-1B、Llamba-3B 和 Llamba-8B,它們比基於 Transformer 的模型實現更高的推理吞吐量,並處理顯著更大的批次大小,同時保持可比較的基準效能。此外,Llamba 證明了使用 MOHAWK(Bick 等人,2024 年)進行跨架構萃取的有效性,在訓練資料不到類似大小模型通常使用的 0.1% 的情況下實現了這些結果。為了充分利用其效率,我們為 Llamba 提供了針對資源受限裝置(例如智慧型手機和邊緣平台)的最佳化實作,提供實用且記憶體效率高的 Transformer 替代方案。總體而言,Llamba 改善了速度、記憶體效率和效能之間的權衡,讓高品質語言模型更易於取得。 -##### **SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?** -2502.13233v1 by Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu +##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** +2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu -Large Language Models (LLMs) have shown remarkable capabilities in general -domains but often struggle with tasks requiring specialized knowledge. -Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve -external information from static knowledge bases, which can be outdated or -incomplete, missing fine-grained clinical details essential for accurate -medical question answering. In this work, we propose SearchRAG, a novel -framework that overcomes these limitations by leveraging real-time search -engines. Our method employs synthetic query generation to convert complex -medical questions into search-engine-friendly queries and utilizes -uncertainty-based knowledge selection to filter and incorporate the most -relevant and informative medical knowledge into the LLM's input. Experimental -results demonstrate that our method significantly improves response accuracy in -medical question answering tasks, particularly for complex questions requiring -detailed and up-to-date knowledge. +To enhance tourists' experiences and immersion, this paper proposes a +narrative-driven travel planning framework called NarrativeGuide, which +generates a geoculturally-grounded narrative script for travelers, offering a +novel, role-playing experience for their journey. In the initial stage, +NarrativeGuide constructs a knowledge graph for attractions within a city, then +configures the worldview, character setting, and exposition based on the +knowledge graph. Using this foundation, the knowledge graph is combined to +generate an independent scene unit for each attraction. During the itinerary +planning stage, NarrativeGuide models narrative-driven travel planning as an +optimization problem, utilizing a genetic algorithm (GA) to refine the +itinerary. Before evaluating the candidate itinerary, transition scripts are +generated for each pair of adjacent attractions, which, along with the scene +units, form a complete script. The weighted sum of script coherence, travel +time, and attraction scores is then used as the fitness value to update the +candidate solution set. Experimental results across four cities, i.e., Nanjing +and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate +significant improvements in narrative coherence and cultural fit, alongside a +notable reduction in travel time and an increase in the quality of visited +attractions. Our study highlights that incorporating external evolutionary +optimization effectively addresses the limitations of large language models in +travel planning.Our codes are available at +https://github.com/Evan01225/Narrative-Driven-Travel-Planning. -摘要:大型語言模型 (LLM) 在一般領域展現出驚人的能力,但經常在需要專業知識的任務中掙扎。 -傳統的檢索增強生成 (RAG) 技術通常從靜態知識庫中檢索外部資訊,這些資訊可能過時或不完整,缺少準確回答醫療問題所需的細微臨床細節。在這項工作中,我們提出 SearchRAG,這是一種新穎的架構,透過利用即時搜尋引擎克服這些限制。我們的模型採用合成查詢生成,將複雜的醫療問題轉換成搜尋引擎友善的查詢,並利用基於不確定性的知識選擇來過濾和納入 LLM 輸入中最相關且最有資訊的醫療知識。實驗結果證明,我們的模型顯著改善了醫療問題回答任務中的回應準確度,特別是需要詳細且最新的知識的複雜問題。 +摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 -##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions** -2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić +##### **Optimal word order for non-causal text generation with Large Language Models: the Spanish case** +2502.14451v1 by Andrea Busto-Castiñeira, Silvia García-Méndez, Francisco de Arriba-Pérez, Francisco J. González-Castaño -We present an end-to-end framework for generating synthetic users for -evaluating interactive agents designed to encourage positive behavior changes, -such as in health and lifestyle coaching. The synthetic users are grounded in -health and lifestyle conditions, specifically sleep and diabetes management in -this study, to ensure realistic interactions with the health coaching agent. -Synthetic users are created in two stages: first, structured data are generated -grounded in real-world health and lifestyle factors in addition to basic -demographics and behavioral attributes; second, full profiles of the synthetic -users are developed conditioned on the structured data. Interactions between -synthetic users and the coaching agent are simulated using generative -agent-based models such as Concordia, or directly by prompting a language -model. Using two independently-developed agents for sleep and diabetes coaching -as case studies, the validity of this framework is demonstrated by analyzing -the coaching agent's understanding of the synthetic users' needs and -challenges. Finally, through multiple blinded evaluations of user-coach -interactions by human experts, we demonstrate that our synthetic users with -health and behavioral attributes more accurately portray real human users with -the same attributes, compared to generic synthetic users not grounded in such -attributes. The proposed framework lays the foundation for efficient -development of conversational agents through extensive, realistic, and grounded -simulated interactions. +Natural Language Generation (NLG) popularity has increased owing to the +progress in Large Language Models (LLMs), with zero-shot inference +capabilities. However, most neural systems utilize decoder-only causal +(unidirectional) transformer models, which are effective for English but may +reduce the richness of languages with less strict word order, subject omission, +or different relative clause attachment preferences. This is the first work +that analytically addresses optimal text generation order for non-causal +language models. We present a novel Viterbi algorithm-based methodology for +maximum likelihood word order estimation. We analyze the non-causal +most-likelihood order probability for NLG in Spanish and, then, the probability +of generating the same phrases with Spanish causal NLG. This comparative +analysis reveals that causal NLG prefers English-like SVO structures. We also +analyze the relationship between optimal generation order and causal +left-to-right generation order using Spearman's rank correlation. Our results +demonstrate that the ideal order predicted by the maximum likelihood estimator +is not closely related to the causal order and may be influenced by the +syntactic structure of the target sentence. -摘要:我們提供了一個端到端的架構,用於為評估互動式代理生成合成使用者,這些代理旨在鼓勵正向行為改變,例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎,特別是本研究中的睡眠和糖尿病管理,以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立:首先,除了基本人口統計資料和行為屬性外,還會產生以現實世界的健康和生活方式因素為基礎的結構化資料;其次,會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型(例如 Concordia)模擬的,或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究,通過分析指導代理對合成使用者需求和挑戰的理解,證明了此架構的有效性。最後,通過人類專家對使用者指導互動進行多重盲測評估,我們證明了與未以這些屬性為基礎的通用合成使用者相比,具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動,為對話代理的有效開發奠定了基礎。 +摘要:自然語言生成 (NLG) 的普及歸功於大型語言模型 (LLM) 的進步,以及零次學習推論能力。然而,大多數神經系統使用僅解碼器因果 (單向) Transformer模型,這對英語很有效,但可能會減少語序較不嚴謹、省略主詞或相對從句附加偏好不同的語言的豐富性。這是第一個針對非因果語言模型分析性地解決最佳文字生成順序的研究。我們提出了一種基於維特比演算法的新方法,用於最大似然詞序估計。我們分析了西班牙語 NLG 的非因果最大似然順序機率,然後分析了使用西班牙語因果 NLG 生成相同短語的機率。這種比較分析顯示,因果 NLG 偏好英語式的 SVO 結構。我們還使用 Spearman 等級相關性分析最佳生成順序和因果從左到右生成順序之間的關係。我們的結果表明,最大似然估計器預測的理想順序與因果順序沒有密切關係,並且可能會受到目標句子的語法結構影響。 -##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization** -2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar -Clinical Question Answering (CQA) plays a crucial role in medical -decision-making, enabling physicians to extract relevant information from -Electronic Medical Records (EMRs). While transformer-based models such as BERT, -BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in -CQA, existing models lack the ability to categorize extracted answers, which is -critical for structured retrieval, content filtering, and medical decision -support. - To address this limitation, we introduce a Multi-Task Learning (MTL) -framework that jointly trains CQA models for both answer extraction and medical -categorization. In addition to predicting answer spans, our model classifies -responses into five standardized medical categories: Diagnosis, Medication, -Symptoms, Procedure, and Lab Reports. This categorization enables more -structured and interpretable outputs, making clinical QA models more useful in -real-world healthcare settings. - We evaluate our approach on emrQA, a large-scale dataset for medical question -answering. Results show that MTL improves F1-score by 2.2% compared to standard -fine-tuning, while achieving 90.7% accuracy in answer categorization. These -findings suggest that MTL not only enhances CQA performance but also introduces -an effective mechanism for categorization and structured medical information -retrieval. +### Knowledge Graphs +|Publish Date|Title|Authors|Homepage|Code| +| :---: | :---: | :---: | :---: | :---: | +|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| +|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| +|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| +|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| +|**2025-02-20**|**Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment**|Jiaxi Li et.al.|[2502.14275v1](http://arxiv.org/abs/2502.14275v1)|null| +|**2025-02-20**|**Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering**|Rongzhi Zhu et.al.|[2502.14245v1](http://arxiv.org/abs/2502.14245v1)|null| +|**2025-02-20**|**NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM**|Jiayin Lan et.al.|[2502.14192v1](http://arxiv.org/abs/2502.14192v1)|null| +|**2025-02-19**|**Object-centric Binding in Contrastive Language-Image Pretraining**|Rim Assouel et.al.|[2502.14113v1](http://arxiv.org/abs/2502.14113v1)|null| +|**2025-02-19**|**Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning**|Cole Gawin et.al.|[2502.14086v1](http://arxiv.org/abs/2502.14086v1)|null| +|**2025-02-19**|**Neurosymbolic artificial intelligence via large language models and coherence-driven inference**|Steve Huntsman et.al.|[2502.13953v1](http://arxiv.org/abs/2502.13953v1)|null| +|**2025-02-19**|**Complex Ontology Matching with Large Language Model Embeddings**|Guilherme Sousa et.al.|[2502.13619v1](http://arxiv.org/abs/2502.13619v1)|null| +|**2025-02-19**|**Are Large Language Models In-Context Graph Learners?**|Jintang Li et.al.|[2502.13562v1](http://arxiv.org/abs/2502.13562v1)|null| +|**2025-02-19**|**Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs**|Yushi Feng et.al.|[2502.13555v1](http://arxiv.org/abs/2502.13555v1)|[link](https://github.com/ys-feng/DemoGraph)| +|**2025-02-19**|**PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference**|Burc Gokden et.al.|[2502.13502v1](http://arxiv.org/abs/2502.13502v1)|[link](https://github.com/burcgokden/PLDR-LLM-with-KVG-cache)| +|**2025-02-19**|**Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction**|Yanbang Sun et.al.|[2502.13412v1](http://arxiv.org/abs/2502.13412v1)|null| +|**2025-02-19**|**Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval**|Aditya Sharma et.al.|[2502.13369v1](http://arxiv.org/abs/2502.13369v1)|null| +|**2025-02-19**|**Craw4LLM: Efficient Web Crawling for LLM Pretraining**|Shi Yu et.al.|[2502.13347v1](http://arxiv.org/abs/2502.13347v1)|[link](https://github.com/cxcscmu/crawl4llm)| +|**2025-02-18**|**K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction**|Tassallah Abdullahi et.al.|[2502.13344v1](http://arxiv.org/abs/2502.13344v1)|[link](https://github.com/rsinghlab/K-Paths)| +|**2025-02-18**|**Grounding LLM Reasoning with Knowledge Graphs**|Alfonso Amayuelas et.al.|[2502.13247v1](http://arxiv.org/abs/2502.13247v1)|null| +|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null| +|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null| +|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null| +|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null| +|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null| +|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|[link](https://github.com/yuhan1i/g-refer)| +|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null| +|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null| +|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null| +|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|[link](https://github.com/acnlplab/umr-text-gen)| +|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null| +|**2025-02-17**|**Exploring LLM-based Student Simulation for Metacognitive Cultivation**|Haoxuan Li et.al.|[2502.11678v1](http://arxiv.org/abs/2502.11678v1)|null| +|**2025-02-17**|**Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering**|Runxuan Liu et.al.|[2502.11491v1](http://arxiv.org/abs/2502.11491v1)|null| +|**2025-02-17**|**GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion**|Kangyang Luo et.al.|[2502.11471v1](http://arxiv.org/abs/2502.11471v1)|null| +|**2025-02-16**|**Large Language-Geometry Model: When LLM meets Equivariance**|Zongzhao Li et.al.|[2502.11149v2](http://arxiv.org/abs/2502.11149v2)|null| +|**2025-02-16**|**Beyond Pairwise: Global Zero-shot Temporal Graph Generation**|Alon Eirew et.al.|[2502.11114v1](http://arxiv.org/abs/2502.11114v1)|null| +|**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|[link](https://github.com/alexlecu/llmkgraph)| +|**2025-02-16**|**Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection**|Yang Zhao et.al.|[2502.11062v1](http://arxiv.org/abs/2502.11062v1)|null| +|**2025-02-16**|**CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models**|Yuefei Chen et.al.|[2502.11008v1](http://arxiv.org/abs/2502.11008v1)|null| +|**2025-02-16**|**RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation**|Pengcheng Jiang et.al.|[2502.10996v1](http://arxiv.org/abs/2502.10996v1)|[link](https://github.com/pat-jj/Retrieval-And-Structure)| +|**2025-02-15**|**Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia**|Rohith Perumandla et.al.|[2502.10896v1](http://arxiv.org/abs/2502.10896v1)|null| +|**2025-02-15**|**Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)**|Sandra Schaftner et.al.|[2502.10768v1](http://arxiv.org/abs/2502.10768v1)|null| +|**2025-02-15**|**K-Edit: Language Model Editing with Contextual Knowledge Awareness**|Elan Markowitz et.al.|[2502.10626v1](http://arxiv.org/abs/2502.10626v1)|null| +|**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null| +|**2025-02-14**|**GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs**|Shima Khoshraftar et.al.|[2502.10522v1](http://arxiv.org/abs/2502.10522v1)|null| +|**2025-02-14**|**Do Large Language Models Reason Causally Like Us? Even Better?**|Hanna M. Dettki et.al.|[2502.10215v1](http://arxiv.org/abs/2502.10215v1)|null| +|**2025-02-14**|**Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages**|Daniil Gurgurov et.al.|[2502.10140v1](http://arxiv.org/abs/2502.10140v1)|null| +|**2025-02-14**|**Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models**|Chenrui Tie et.al.|[2502.10090v1](http://arxiv.org/abs/2502.10090v1)|null| +|**2025-02-14**|**Decision Information Meets Large Language Models: The Future of Explainable Operations Research**|Yansen Zhang et.al.|[2502.09994v1](http://arxiv.org/abs/2502.09994v1)|null| +|**2025-02-14**|**KGGen: Extracting Knowledge Graphs from Plain Text with Language Models**|Belinda Mo et.al.|[2502.09956v1](http://arxiv.org/abs/2502.09956v1)|null| +|**2025-02-14**|**ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation**|Shu Wang et.al.|[2502.09891v1](http://arxiv.org/abs/2502.09891v1)|null| +|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null| +|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null| +|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null| +|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v3](http://arxiv.org/abs/2502.08346v3)|null| +|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null| +|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null| +|**2025-02-12**|**LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search**|Yang Gao et.al.|[2502.10459v1](http://arxiv.org/abs/2502.10459v1)|null| +|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null| +|**2025-02-12**|**Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality**|Xin Kang et.al.|[2502.09658v1](http://arxiv.org/abs/2502.09658v1)|null| +|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null| +|**2025-02-12**|**Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach**|Régnier Avice et.al.|[2502.10453v1](http://arxiv.org/abs/2502.10453v1)|[link](https://github.com/ravice234/cryptoasset-attribution-tag-linker)| +|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null| +|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null| +|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)| +|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null| +|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|[link](https://github.com/YuxingLu613/KARMA)| +|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null| +|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null| +|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null| +|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null| +|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null| +|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null| +|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null| +|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null| +|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)| +|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null| +|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|[link](https://github.com/Infini-AI-Lab/gsm_infinite)| +|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null| +|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)| +|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null| +|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)| +|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)| +|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null| +|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)| +|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)| +|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null| +|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null| +|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null| +|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null| +|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null| +|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null| +|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null| +|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null| +|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null| +|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null| +|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null| +|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)| +|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null| +|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null| +|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)| + +#### Abstracts +##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** +2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu + +Large Language Models (LLMs) have shown great promise in tool-making, yet +existing frameworks often struggle to efficiently construct reliable toolsets +and are limited to single-task settings. To address these challenges, we +propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that +dynamically constructs and evolves a hierarchical graph of reusable tools +across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), +agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, +TabMWP). Our results show that GATE achieves up to 4.3x faster milestone +completion in Minecraft compared to the previous SOTA, and provides an average +improvement of 9.23% over existing tool-making methods in code generation tasks +and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, +balancing tool quantity, complexity, and functionality while maintaining high +efficiency. Code and data are available at +\url{https://github.com/ayanami2003/GATE}. -摘要:臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色,讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能,但現有的模型缺乏分類擷取答案的能力,這對於結構化檢索、內容過濾和醫療決策支援至關重要。 - 為了解決這個限制,我們引進了一個多任務學習 (MTL) 架構,它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍,我們的模型將回應分類為五個標準化醫療類別:診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出,讓臨床問答模型在真實世界的醫療保健環境中更實用。 - 我們在 emrQA 上評估我們的做法,emrQA 是用於醫療問題解答的大規模資料集。結果顯示,與標準微調相比,MTL 將 F1 分數提高了 2.2%,同時在答案分類中達到 90.7% 的準確度。這些發現表明,MTL 不僅增強了 CQA 的效能,還引入了一種分類和結構化醫療資訊檢索的有效機制。 +摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 -##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection** -2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert +##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** +2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su -Detection of hyperenhancement from cardiac LGE MRI images is a complex task -requiring significant clinical expertise. Although deep learning-based models -have shown promising results for the task, they require large amounts of data -with fine-grained annotations. Clinical reports generated for cardiac MR -studies contain rich, clinically relevant information, including the location, -extent and etiology of any scars present. Although recently developed -CLIP-based training enables pretraining models with image-text pairs, it -requires large amounts of data and further finetuning strategies on downstream -tasks. In this study, we use various strategies rooted in domain knowledge to -train a model for LGE detection solely using text from clinical reports, on a -relatively small clinical cohort of 965 patients. We improve performance -through the use of synthetic data augmentation, by systematically creating scar -images and associated text. In addition, we standardize the orientation of the -images in an anatomy-informed way to enable better alignment of spatial and -text features. We also use a captioning loss to enable fine-grained supervision -and explore the effect of pretraining of the vision encoder on performance. -Finally, ablation studies are carried out to elucidate the contributions of -each design component to the overall performance of the model. +Our ability to continuously acquire, organize, and leverage knowledge is a +key feature of human intelligence that AI systems must approximate to unlock +their full potential. Given the challenges in continual learning with large +language models (LLMs), retrieval-augmented generation (RAG) has become the +dominant way to introduce new information. However, its reliance on vector +retrieval hinders its ability to mimic the dynamic and interconnected nature of +human long-term memory. Recent RAG approaches augment vector embeddings with +various structures like knowledge graphs to address some of these gaps, namely +sense-making and associativity. However, their performance on more basic +factual memory tasks drops considerably below standard RAG. We address this +unintended deterioration and propose HippoRAG 2, a framework that outperforms +standard RAG comprehensively on factual, sense-making, and associative memory +tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in +HippoRAG and enhances it with deeper passage integration and more effective +online use of an LLM. This combination pushes this RAG system closer to the +effectiveness of human long-term memory, achieving a 7% improvement in +associative memory tasks over the state-of-the-art embedding model while also +exhibiting superior factual knowledge and sense-making memory capabilities. +This work paves the way for non-parametric continual learning for LLMs. Our +code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. -摘要:從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務,需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果,但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊,包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型,但它需要大量資料和進一步微調下游任務的策略。在這項研究中,我們使用植基於領域知識的各種策略,僅使用來自臨床報告的文字,在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能,系統性地建立疤痕影像和相關文字。此外,我們以解剖學告知的方式標準化影像方向,以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督,並探討視覺編碼器的預訓練對效能的影響。最後,進行消融研究以闡明每個設計元件對模型整體效能的貢獻。 +摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 -##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models** -2502.12825v2 by Rubing Li, João Sedoc, Arun Sundararajan +##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** +2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao -When encountering increasingly frequent performance improvements or cost -reductions from a new large language model (LLM), developers of applications -leveraging LLMs must decide whether to take advantage of these improvements or -stay with older tried-and-tested models. Low perceived switching frictions can -lead to choices that do not consider more subtle behavior changes that the -transition may induce. Our experiments use a popular game-theoretic behavioral -economics model of trust to show stark differences in the trusting behavior of -OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust -behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing -and risk-seeking with future returns from trust, and contrast it with -DeepSeek's more sophisticated and profitable trusting behavior that stems from -an ability to incorporate deeper concepts like forward planning and -theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our -results highlight the perils of relying on LLM performance benchmarks that are -too narrowly defined and suggest that careful analysis of their hidden fault -lines should be part of any organization's AI strategy. +Large Language Models (LLMs) have demonstrated exceptional abilities in +reasoning for task planning. However, challenges remain under-explored for +parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in +which the model first decomposes a real-life textual task into executable +subtasks and constructs an abstract task graph. The model then understands this +task graph as input and generates a plan for parallel execution. To enhance the +planning capability of complex, scalable graphs, we design an automated and +controllable pipeline to generate synthetic graphs and propose a two-stage +training scheme. Experimental results show that our plan-over-graph method +significantly improves task performance on both API-based LLMs and trainable +open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally +supports parallel execution, demonstrating global efficiency. The code and data +are available at https://github.com/zsq259/Plan-over-Graph. -摘要:在遇到大型語言模型 (LLM) 頻頻帶來的效能提升或成本降低時,利用 LLM 的應用程式開發人員必須決定是否要利用這些提升,或繼續使用較舊且經過驗證的模型。低感知切換摩擦可能會導致選擇,而沒有考慮轉換可能引發的更細微行為變更。我們的實驗使用流行的博弈論行為經濟信任模型,以顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰,因為它們調和了利潤最大化和冒險,以及來自信任的未來回報,並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比,這種行為源於整合更深入的概念,例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎,我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險,並建議仔細分析其隱藏的斷層線應該是任何組織 AI 策略的一部分。 +摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 -##### **LLM Safety for Children** -2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat +##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** +2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu -This paper analyzes the safety of Large Language Models (LLMs) in -interactions with children below age of 18 years. Despite the transformative -applications of LLMs in various aspects of children's lives such as education -and therapy, there remains a significant gap in understanding and mitigating -potential content harms specific to this demographic. The study acknowledges -the diverse nature of children often overlooked by standard safety evaluations -and proposes a comprehensive approach to evaluating LLM safety specifically for -children. We list down potential risks that children may encounter when using -LLM powered applications. Additionally we develop Child User Models that -reflect the varied personalities and interests of children informed by -literature in child care and psychology. These user models aim to bridge the -existing gap in child safety literature across various fields. We utilize Child -User Models to evaluate the safety of six state of the art LLMs. Our -observations reveal significant safety gaps in LLMs particularly in categories -harmful to children but not adults +To enhance tourists' experiences and immersion, this paper proposes a +narrative-driven travel planning framework called NarrativeGuide, which +generates a geoculturally-grounded narrative script for travelers, offering a +novel, role-playing experience for their journey. In the initial stage, +NarrativeGuide constructs a knowledge graph for attractions within a city, then +configures the worldview, character setting, and exposition based on the +knowledge graph. Using this foundation, the knowledge graph is combined to +generate an independent scene unit for each attraction. During the itinerary +planning stage, NarrativeGuide models narrative-driven travel planning as an +optimization problem, utilizing a genetic algorithm (GA) to refine the +itinerary. Before evaluating the candidate itinerary, transition scripts are +generated for each pair of adjacent attractions, which, along with the scene +units, form a complete script. The weighted sum of script coherence, travel +time, and attraction scores is then used as the fitness value to update the +candidate solution set. Experimental results across four cities, i.e., Nanjing +and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate +significant improvements in narrative coherence and cultural fit, alongside a +notable reduction in travel time and an increase in the quality of visited +attractions. Our study highlights that incorporating external evolutionary +optimization effectively addresses the limitations of large language models in +travel planning.Our codes are available at +https://github.com/Evan01225/Narrative-Driven-Travel-Planning. -摘要:本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面(例如教育和治療)都有轉變性的應用,但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性,而標準安全評估通常會忽略這些多樣性,並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外,我們開發了兒童使用者模型,這些模型反映了兒童不同的個性特質和興趣,並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞,特別是在對兒童有害但對成年人無害的類別中 +摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 -##### **Classifiers of Data Sharing Statements in Clinical Trial Records** -2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth +##### **Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment** +2502.14275v1 by Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu -Digital individual participant data (IPD) from clinical trials are -increasingly distributed for potential scientific reuse. The identification of -available IPD, however, requires interpretations of textual data-sharing -statements (DSS) in large databases. Recent advancements in computational -linguistics include pre-trained language models that promise to simplify the -implementation of effective classifiers based on textual inputs. In a subset of -5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers -based on domain-specific pre-trained language models reproduce original -availability categories as well as manually annotated labels. Typical metrics -indicate that classifiers that predicted manual annotations outperformed those -that learned to output the original availability categories. This suggests that -the textual DSS descriptions contain applicable information that the -availability categories do not, and that such classifiers could thus aid the -automatic identification of available IPD in large trial databases. +Large language models (LLMs) have been widely adopted in various downstream +task domains. However, their ability to directly recall and apply factual +medical knowledge remains under-explored. Most existing medical QA benchmarks +assess complex reasoning or multi-hop inference, making it difficult to isolate +LLMs' inherent medical knowledge from their reasoning capabilities. Given the +high-stakes nature of medical applications, where incorrect information can +have critical consequences, it is essential to evaluate how well LLMs encode, +retain, and recall fundamental medical facts. + To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset +specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ +is constructed from the Unified Medical Language System (UMLS), a large-scale +repository of standardized biomedical vocabularies and knowledge graphs. We +frame knowledge assessment as a binary judgment task, requiring LLMs to verify +the correctness of medical statements extracted from reliable and structured +knowledge sources. + Our experiments reveal that LLMs struggle with factual medical knowledge +retention, exhibiting significant performance variance across different +semantic categories, particularly for rare medical conditions. Furthermore, +LLMs show poor calibration, often being overconfident in incorrect answers. To +mitigate these issues, we explore retrieval-augmented generation, demonstrating +its effectiveness in improving factual accuracy and reducing uncertainty in +medical decision-making. -摘要:臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而,要找出可用的 IPD,需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型,有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中,我們評估了基於特定領域預先訓練語言模型的分類器,在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示,預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊,而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。 +摘要:大型語言模型 (LLM) 已廣泛應用於各種下游 +任務領域。然而,它們直接回憶和應用事實 +醫學知識的能力仍未得到充分探索。大多數現有的醫療問答基準 +評估複雜推理或多跳躍推論,這使得難以將 +LLM 內在的醫學知識從其推理能力中分離出來。鑑於 +醫療應用具有高風險,其中不正確的資訊可能會 +造成嚴重後果,因此評估 LLM 編碼、 +保留和回憶基本醫學事實的能力至關重要。 +為了彌合這一差距,我們引入了醫學知識判斷,這是一個專門設計用於測量 LLM 的一跳事實醫學知識的數據集。MKJ +是由統一醫學語言系統 (UMLS) 構建的,UMLS 是標準化生物醫學詞彙和知識圖譜的大型庫。我們 +將知識評估構建為二元判斷任務,要求 LLM 驗證從可靠且結構化的 +知識來源中提取的醫學陳述的正確性。 +我們的實驗表明,LLM 難以保留事實醫學知識,在不同的 +語義類別中表現出顯著的性能差異,特別是對於罕見的醫療狀況。此外, +LLM 表現出校準不佳,通常對不正確的答案過於自信。為了 +減輕這些問題,我們探索了檢索增強生成,證明了其在提高事實準確性和降低不確定性方面的有效性 +在醫療決策制定中。 -##### **Relational Norms for Human-AI Cooperation** -2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark +##### **Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering** +2502.14245v1 by Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu -How we should design and interact with social artificial intelligence depends -on the socio-relational role the AI is meant to emulate or occupy. In human -society, relationships such as teacher-student, parent-child, neighbors, -siblings, or employer-employee are governed by specific norms that prescribe or -proscribe cooperative functions including hierarchy, care, transaction, and -mating. These norms shape our judgments of what is appropriate for each -partner. For example, workplace norms may allow a boss to give orders to an -employee, but not vice versa, reflecting hierarchical and transactional -expectations. As AI agents and chatbots powered by large language models are -increasingly designed to serve roles analogous to human positions - such as -assistant, mental health provider, tutor, or romantic partner - it is -imperative to examine whether and how human relational norms should extend to -human-AI interactions. Our analysis explores how differences between AI systems -and humans, such as the absence of conscious experience and immunity to -fatigue, may affect an AI's capacity to fulfill relationship-specific functions -and adhere to corresponding norms. This analysis, which is a collaborative -effort by philosophers, psychologists, relationship scientists, ethicists, -legal experts, and AI researchers, carries important implications for AI -systems design, user behavior, and regulation. While we accept that AI systems -can offer significant benefits such as increased availability and consistency -in certain socio-relational roles, they also risk fostering unhealthy -dependencies or unrealistic expectations that could spill over into human-human -relationships. We propose that understanding and thoughtfully shaping (or -implementing) suitable human-AI relational norms will be crucial for ensuring -that human-AI interactions are ethical, trustworthy, and favorable to human -well-being. +In this paper, we identify a critical problem, "lost-in-retrieval", in +retrieval-augmented multi-hop question answering (QA): the key entities are +missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly +degrades the retrieval performance, which disrupts the reasoning chain and +leads to the incorrect answers. To resolve this problem, we propose a +progressive retrieval and rewriting method, namely ChainRAG, which sequentially +handles each sub-question by completing missing key entities and retrieving +relevant sentences from a sentence graph for answer generation. Each step in +our retrieval and rewriting process builds upon the previous one, creating a +seamless chain that leads to accurate retrieval and answers. Finally, all +retrieved sentences and sub-question answers are integrated to generate a +comprehensive answer to the original question. We evaluate ChainRAG on three +multi-hop QA datasets$\unicode{x2013}$MuSiQue, 2Wiki, and +HotpotQA$\unicode{x2013}$using three large language models: GPT4o-mini, +Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG +consistently outperforms baselines in both effectiveness and efficiency. + +摘要:在本文中,我們在檢索增強的多跳問答 (QA) 中發現了一個關鍵問題「檢索中遺失」,關鍵實體遺失在 LLM 的子問題分解中。「檢索中遺失」顯著降低檢索效能,這會中斷推理鏈並導致錯誤的答案。為了解決此問題,我們提出了一種漸進式檢索和重寫方法,即 ChainRAG,它通過完成遺失的關鍵實體並從句子圖中檢索相關句子來順序處理每個子問題以產生答案。我們檢索和重寫過程中每一步都建立在前一步之上,創造了一個無縫的鏈,導致準確的檢索和答案。最後,所有檢索到的句子和子問題答案都整合起來,以產生對原始問題的全面答案。我們在三個多跳問答資料集$\unicode{x2013}$MuSiQue、2Wiki 和 HotpotQA$\unicode{x2013}$上評估 ChainRAG,使用三個大型語言模型:GPT4o-mini、Qwen2.5-72B 和 GLM-4-Plus。實證結果表明,ChainRAG 在有效性和效率方面都持續優於基準。 + +##### **NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM** +2502.14192v1 by Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin + +Large language models (LLMs) have been widely applied in question answering +over scientific research papers. To enhance the professionalism and accuracy of +responses, many studies employ external knowledge augmentation. However, +existing structures of external knowledge in scientific literature often focus +solely on either paper entities or domain concepts, neglecting the intrinsic +connections between papers through shared domain concepts. This results in less +comprehensive and specific answers when addressing questions that combine +papers and concepts. To address this, we propose a novel knowledge graph +framework that captures deep conceptual relations between academic papers, +constructing a relational network via intra-paper semantic elements and +inter-paper citation relations. Using a few-shot knowledge graph construction +method based on LLM, we develop NLP-AKG, an academic knowledge graph for the +NLP domain, by extracting 620,353 entities and 2,271,584 relations from 60,826 +papers in ACL Anthology. Based on this, we propose a 'sub-graph community +summary' method and validate its effectiveness on three NLP scientific +literature question answering datasets. -摘要:我們應如何設計和與社交人工智慧互動,取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中,師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配,這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如,職場規範可能允許老闆對員工發號施令,但反之則不行,這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色,例如助理、心理健康提供者、導師或浪漫伴侶,審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異,例如缺乏意識體驗和對疲勞的免疫力,如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果,對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處,例如增加可用性和一致性,但它們也可能助長不健康的依賴關係或不切實際的期望,這些期望可能會蔓延到人際關係中。我們提出,理解和深思熟慮地塑造(或實施)適當的人類與人工智慧關係規範,對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。 +摘要:大型语言模型 (LLM) 已广泛应用于科学研究论文的问答中。为了提高响应的专业性和准确性,许多研究采用外部知识增强。然而,科学文献中现有外部知识的结构通常仅关注论文实体或领域概念,而忽略了论文之间通过共享领域概念而形成的内在联系。这导致在解决结合论文和概念的问题时,答案不够全面和具体。为了解决这个问题,我们提出了一种新颖的知识图谱框架,该框架捕获了学术论文之间的深层概念关系,通过论文内部语义元素和论文之间的引用关系构建关系网络。我们使用基于 LLM 的少量知识图谱构建方法,从 ACL Anthology 中的 60,826 篇论文中提取了 620,353 个实体和 2,271,584 个关系,开发了 NLP 领域的学术知识图谱 NLP-AKG。在此基础上,我们提出了一种“子图社区摘要”方法,并在三个 NLP 科学文献问答数据集上验证了其有效性。 -##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis** -2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy +##### **Object-centric Binding in Contrastive Language-Image Pretraining** +2502.14113v1 by Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano -Air quality prediction is key to mitigating health impacts and guiding -decisions, yet existing models tend to focus on temporal trends while -overlooking spatial generalization. We propose AQ-Net, a spatiotemporal -reanalysis model for both observed and unobserved stations in the near future. -AQ-Net utilizes the LSTM and multi-head attention for the temporal regression. -We also propose a cyclic encoding technique to ensure continuous time -representation. To learn fine-grained spatial air quality estimation, we -incorporate AQ-Net with the neural kNN to explore feature-based interpolation, -such that we can fill the spatial gaps given coarse observation stations. To -demonstrate the efficiency of our model for spatiotemporal reanalysis, we use -data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive -experiments show that AQ-Net excels in air quality reanalysis, highlighting the -potential of hybrid spatio-temporal models to better capture environmental -dynamics, especially in urban areas where both spatial and temporal variability -are critical. +Recent advances in vision language models (VLM) have been driven by +contrastive models such as CLIP, which learn to associate visual information +with their corresponding text descriptions. However, these models have +limitations in understanding complex compositional scenes involving multiple +objects and their spatial relationships. To address these challenges, we +propose a novel approach that diverges from commonly used strategies, which +rely on the design of hard-negative augmentations. Instead, our work focuses on +integrating inductive biases into pre-trained CLIP-like models to improve their +compositional understanding without using any additional hard-negatives. To +that end, we introduce a binding module that connects a scene graph, derived +from a text description, with a slot-structured image representation, +facilitating a structured similarity assessment between the two modalities. We +also leverage relationships as text-conditioned visual constraints, thereby +capturing the intricate interactions between objects and their contextual +relationships more effectively. Our resulting model not only enhances the +performance of CLIP-based models in multi-object compositional understanding +but also paves the way towards more accurate and sample-efficient image-text +matching of complex scenes. -摘要:空气品质预测是减轻健康影响和指导决策的关键,但现有的模型倾向于关注时间趋势,而忽略空间概化。我们提出了 AQ-Net,这是一种时空再分析模型,适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计,我们将 AQ-Net 与神经 kNN 结合起来,以探索基于特征的插值,以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率,我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明,AQ-Net 在空气质量再分析中表现出色,突出了混合时空模型在更好地捕捉环境动态方面的潜力,尤其是在空间和时间变异性都很关键的城市地区。 +摘要:最近视觉语言模型 (VLM) 的进步是由对比模型(例如 CLIP)推动的,该模型学习将视觉信息与其对应的文本描述联系起来。然而,这些模型在理解涉及多个对象及其空间关系的复杂组合场景方面存在局限性。为了应对这些挑战,我们提出了一种新颖的方法,它偏离了常用的策略,即依赖于硬负增强设计。相反,我们的工作重点是将归纳偏差集成到预训练的类似 CLIP 的模型中,以提高其组合理解能力,而无需使用任何其他硬否定。为此,我们引入了一个绑定模块,它将从文本描述中派生的场景图与槽结构图像表示连接起来,从而促进了两种模式之间的结构化相似性评估。我们还利用关系作为文本条件的视觉约束,从而更有效地捕捉对象及其上下文关系之间的复杂交互。我们由此产生的模型不仅增强了基于 CLIP 的模型在多对象组合理解中的性能,而且还为复杂场景的更准确和样本高效的图像文本匹配铺平了道路。 -##### **Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing** -2502.11715v1 by Site Qu, Guoqiang Hu +##### **Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning** +2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal -The Location-Routing Problem (LRP), which combines the challenges of facility -(depot) locating and vehicle route planning, is critically constrained by the -reliance on predefined depot candidates, limiting the solution space and -potentially leading to suboptimal outcomes. Previous research on LRP without -predefined depots is scant and predominantly relies on heuristic algorithms -that iteratively attempt depot placements across a planar area. Such approaches -lack the ability to proactively generate depot locations that meet specific -geographic requirements, revealing a notable gap in current research landscape. -To bridge this gap, we propose a data-driven generative DRL framework, designed -to proactively generate depots for LRP without predefined depot candidates, -solely based on customer requests data which include geographic and demand -information. It can operate in two distinct modes: direct generation of exact -depot locations, and the creation of a multivariate Gaussian distribution for -flexible depots sampling. By extracting depots' geographic pattern from -customer requests data, our approach can dynamically respond to logistical -needs, identifying high-quality depot locations that further reduce total -routing costs compared to traditional methods. Extensive experiments -demonstrate that, for a same group of customer requests, compared with those -depots identified through random attempts, our framework can proactively -generate depots that lead to superior solution routes with lower routing cost. -The implications of our framework potentially extend into real-world -applications, particularly in emergency medical rescue and disaster relief -logistics, where rapid establishment and adjustment of depot locations are -paramount, showcasing its potential in addressing LRP for dynamic and -unpredictable environments. +Large language models (LLMs) have achieved remarkable performance in +generating human-like text and solving reasoning tasks of moderate complexity, +such as question-answering and mathematical problem-solving. However, their +capabilities in tasks requiring deeper cognitive skills, such as common-sense +understanding and abstract reasoning, remain under-explored. In this paper, we +systematically evaluate abstract common-sense reasoning in LLMs using the +ConceptNet knowledge graph. We propose two prompting approaches: instruct +prompting, where models predict plausible semantic relationships based on +provided definitions, and few-shot prompting, where models identify relations +using examples as guidance. Our experiments with the gpt-4o-mini model show +that in instruct prompting, consistent performance is obtained when ranking +multiple relations but with substantial decline when the model is restricted to +predicting only one relation. In few-shot prompting, the model's accuracy +improves significantly when selecting from five relations rather than the full +set, although with notable bias toward certain relations. These results suggest +significant gaps still, even in commercially used LLMs' abstract common-sense +reasoning abilities, compared to human-level understanding. However, the +findings also highlight the promise of careful prompt engineering, based on +selective retrieval, for obtaining better performance. -摘要:地點路線問題(LRP)結合了設施(倉庫)定位和車輛路線規劃的挑戰,嚴重受到預先定義的倉庫候選限制,限制了解決方案空間,並可能導致次優結果。先前關於沒有預先定義倉庫的 LRP 研究很少,而且主要依賴於啟發式演算法,在平面區域中反覆嘗試倉庫配置。這種方法無法主動產生符合特定地理需求的倉庫位置,顯示了當前研究領域的顯著差距。為了彌補這個差距,我們提出一個資料驅動的生成式 DRL 架構,旨在主動為 LRP 產生倉庫,而無需預先定義的倉庫候選,僅根據包含地理和需求資訊的客戶要求資料。它可以在兩種不同的模式下運作:直接產生確切的倉庫位置,以及建立多元高斯分布以進行彈性倉庫抽樣。透過從客戶要求資料中提取倉庫的地理模式,我們的方法可以動態回應後勤需求,找出高品質的倉庫位置,進一步降低與傳統方法相比的總路線成本。廣泛的實驗證明,對於同一組客戶要求,與透過隨機嘗試識別的那些倉庫相比,我們的架構可以主動產生倉庫,並產生路線成本較低的優質解決方案路線。我們的架構的影響潛在地擴展到實際應用,特別是在緊急醫療救援和災害救災後勤方面,其中倉庫位置的快速建立和調整至關重要,展示了其在解決動態和不可預測環境的 LRP 中的潛力。 +摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。 -##### **LLM Agents Making Agent Tools** -2502.11705v1 by Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather +##### **Neurosymbolic artificial intelligence via large language models and coherence-driven inference** +2502.13953v1 by Steve Huntsman, Jewell Thomas -Tool use has turned large language models (LLMs) into powerful agents that -can perform complex multi-step tasks by dynamically utilising external software -components. However, these tools must be implemented in advance by human -developers, hindering the applicability of LLM agents in domains which demand -large numbers of highly specialised tools, like in life sciences and medicine. -Motivated by the growing trend of scientific studies accompanied by public code -repositories, we propose ToolMaker, a novel agentic framework that autonomously -transforms papers with code into LLM-compatible tools. Given a short task -description and a repository URL, ToolMaker autonomously installs required -dependencies and generates code to perform the task, using a closed-loop -self-correction mechanism to iteratively diagnose and rectify errors. To -evaluate our approach, we introduce a benchmark comprising 15 diverse and -complex computational tasks spanning both medical and non-medical domains with -over 100 unit tests to objectively assess tool correctness and robustness. -ToolMaker correctly implements 80% of the tasks, substantially outperforming -current state-of-the-art software engineering agents. ToolMaker therefore is a -step towards fully autonomous agent-based scientific workflows. +We devise an algorithm to generate sets of propositions that objectively +instantiate graphs that support coherence-driven inference. We then benchmark +the ability of large language models (LLMs) to reconstruct coherence graphs +from (a straightforward transformation of) propositions expressed in natural +language, with promising results from a single prompt to models optimized for +reasoning. Combining coherence-driven inference with consistency evaluations by +neural models may advance the state of the art in machine cognition. -摘要:工具使用已將大型語言模型 (LLM) 轉變為強大的代理,可透過動態使用外部軟體元件來執行複雜的多步驟任務。然而,這些工具必須事先由人類開發人員實作,這會阻礙 LLM 代理在需要大量高度專業化工具的領域(例如生命科學和醫學)中的應用性。受到伴隨公開程式碼儲存庫的科學研究趨勢所啟發,我們提出 ToolMaker,一個創新的代理架構,可自主地將帶有程式碼的論文轉換為相容於 LLM 的工具。給定簡短的任務描述和儲存庫網址,ToolMaker 會自主安裝所需的依賴項,並產生程式碼來執行任務,使用閉環自我修正機制來反覆診斷和糾正錯誤。為了評估我們的做法,我們引進一個包含 15 個不同且複雜的運算任務的基準,涵蓋醫療和非醫療領域,並包含超過 100 個單元測試,以客觀評估工具的正確性和穩健性。ToolMaker 正確實作了 80% 的任務,大幅優於目前的最新軟體工程代理。因此,ToolMaker 是邁向完全自主的基於代理的科學工作流程的一步。 +摘要:我們設計一種演算法,用來產生命題集合,以客觀地實例化支援連貫性驅動推論的圖形。接著,我們基準化大型語言模型 (LLM) 從以自然語言表達的命題(經過直接轉換)重建連貫性圖形的能力,結果顯示,單一提示就能從最佳化用於推理的模型中獲得有希望的結果。將連貫性驅動推論與神經模型的一致性評估結合起來,可能會提升機器認知的現有技術。 -##### **MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression** -2502.11651v1 by Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang +##### **Complex Ontology Matching with Large Language Model Embeddings** +2502.13619v1 by Guilherme Sousa, Rinaldo Lima, Cassia Trojahn -Large vision-language models (LVLMs) have shown great promise in medical -applications, particularly in visual question answering (MedVQA) and diagnosis -from medical images. However, existing datasets and models often fail to -consider critical aspects of medical diagnostics, such as the integration of -historical records and the analysis of disease progression over time. In this -paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel -dataset for MedVQA that focuses on identifying changes in specific regions -between two patient visits. Unlike previous datasets that primarily address -single-image questions, MMXU enables multi-image questions, incorporating both -current and historical patient data. We demonstrate the limitations of current -LVLMs in identifying disease progression on MMXU-\textit{test}, even those that -perform well on traditional benchmarks. To address this, we propose a -MedRecord-Augmented Generation (MAG) approach, incorporating both global and -regional historical records. Our experiments show that integrating historical -records significantly enhances diagnostic accuracy by at least 20\%, bridging -the gap between current LVLMs and human expert performance. Additionally, we -fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable -improvements. We hope this work could illuminate the avenue of advancing the -use of LVLMs in medical diagnostics by emphasizing the importance of historical -context in interpreting medical images. Our dataset is released at -\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU}. +Ontology, and more broadly, Knowledge Graph Matching is a challenging task in +which expressiveness has not been fully addressed. Despite the increasing use +of embeddings and language models for this task, approaches for generating +expressive correspondences still do not take full advantage of these models, in +particular, large language models (LLMs). This paper proposes to integrate LLMs +into an approach for generating expressive correspondences based on alignment +need and ABox-based relation discovery. The generation of correspondences is +performed by matching similar surroundings of instance sub-graphs. The +integration of LLMs results in different architectural modifications, including +label similarity, sub-graph matching, and entity matching. The performance word +embeddings, sentence embeddings, and LLM-based embeddings, was compared. The +results demonstrate that integrating LLMs surpasses all other models, enhancing +the baseline version of the approach with a 45\% increase in F-measure. -摘要:大型視覺語言模型 (LVLMs) 已在醫療應用中展現出極大的潛力,特別是在視覺問答 (MedVQA) 和醫學影像診斷方面。然而,現有的資料集和模型常常無法考量醫療診斷的關鍵層面,例如病歷整合以及隨著時間推移對疾病進程的分析。在本文中,我們介紹 MMXU(多模態多 X 光理解),一個專注於識別兩次患者就診之間特定區域變化的 MedVQA 新資料集。與主要處理單一影像問題的先前資料集不同,MMXU 支援多影像問題,同時納入當前和病史患者資料。我們展示了現有 LVLMs 在 MMXU-\textit{test} 中識別疾病進程的限制,即使是在傳統基準測試中表現良好的 LVLMs 也是如此。為了解決這個問題,我們提出了一個病歷增強生成 (MAG) 方法,結合了全域和區域病史。我們的實驗顯示,整合病歷可顯著提升至少 20% 的診斷準確度,縮小了現有 LVLMs 和人類專家表現之間的差距。此外,我們在 MMXU-\textit{dev} 上微調帶有 MAG 的模型,這展示了顯著的進步。我們希望這項工作能透過強調病史脈絡在解讀醫學影像中的重要性,為推進 LVLMs 在醫療診斷中的應用開闢道路。我們的資料集已於\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU} 發布。 +摘要:本体论,更广泛地说,知识图谱匹配是一项具有挑战性的任务,其中表达力尚未得到充分解决。尽管越来越多地使用嵌入和语言模型来完成此任务,但生成表达性对应关系的方法仍然没有充分利用这些模型,特别是大型语言模型 (LLM)。本文提出将 LLM 集成到一种基于对齐需求和基于 ABox 的关系发现来生成表达性对应关系的方法中。对应关系的生成是通过匹配实例子图的相似周围环境来执行的。LLM 的集成导致了不同的架构修改,包括标签相似性、子图匹配和实体匹配。比较了单词嵌入、句子嵌入和基于 LLM 的嵌入的性能。结果表明,集成 LLM 超越了所有其他模型,通过 F-measure 提高了 45% 的基准版本的方法。 -##### **A Survey of Personalized Large Language Models: Progress and Future Directions** -2502.11528v1 by Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Jieming Zhu, Minda Hu, Menglin Yang, Irwin King +##### **Are Large Language Models In-Context Graph Learners?** +2502.13562v1 by Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Liang Chen, Zibin Zheng -Large Language Models (LLMs) excel in handling general knowledge tasks, yet -they struggle with user-specific personalization, such as understanding -individual emotions, writing styles, and preferences. Personalized Large -Language Models (PLLMs) tackle these challenges by leveraging individual user -data, such as user profiles, historical dialogues, content, and interactions, -to deliver responses that are contextually relevant and tailored to each user's -specific needs. This is a highly valuable research topic, as PLLMs can -significantly enhance user satisfaction and have broad applications in -conversational agents, recommendation systems, emotion recognition, medical -assistants, and more. This survey reviews recent advancements in PLLMs from -three technical perspectives: prompting for personalized context (input level), -finetuning for personalized adapters (model level), and alignment for -personalized preferences (objective level). To provide deeper insights, we also -discuss current limitations and outline several promising directions for future -research. Updated information about this survey can be found at the -https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models. +Large language models (LLMs) have demonstrated remarkable in-context +reasoning capabilities across a wide range of tasks, particularly with +unstructured inputs such as language or images. However, LLMs struggle to +handle structured data, such as graphs, due to their lack of understanding of +non-Euclidean structures. As a result, without additional fine-tuning, their +performance significantly lags behind that of graph neural networks (GNNs) in +graph learning tasks. In this paper, we show that learning on graph data can be +conceptualized as a retrieval-augmented generation (RAG) process, where +specific instances (e.g., nodes or edges) act as queries, and the graph itself +serves as the retrieved context. Building on this insight, we propose a series +of RAG frameworks to enhance the in-context learning capabilities of LLMs for +graph learning tasks. Comprehensive evaluations demonstrate that our proposed +RAG frameworks significantly improve LLM performance on graph-based tasks, +particularly in scenarios where a pretrained LLM must be used without +modification or accessed via an API. -摘要:大型語言模型 (LLM) 在處理一般知識任務方面表現出色,但 -它們在使用者特定的個人化方面有困難,例如理解 -個別的情緒、寫作風格和偏好。個人化大型 -語言模型 (PLLM) 透過利用個別使用者的 -資料來解決這些挑戰,例如使用者個人資料、歷史對話、內容和互動, -提供在脈絡上相關且針對每個使用者的特定需求量身打造的回應。這是一個非常有價值的研究主題,因為 PLLM 可以 -顯著提升使用者滿意度,並在對話代理、推薦系統、情緒辨識、醫療 -助理等方面有廣泛的應用。這項調查從三個技術觀點回顧 PLLM 的最新進展:提示個人化脈絡(輸入層級)、微調個人化適配器(模型層級),以及對齊個人化偏好(目標層級)。為了提供更深入的見解,我們也 -討論目前的限制,並概述未來研究的幾個有希望的方向。這項調查的最新資訊可以在 -https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models 找到。 +摘要:大型語言模型 (LLM) 在廣泛的任務中展示了非凡的語境推理能力,特別是對於語言或影像等非結構化輸入。然而,LLM 難以處理結構化資料,例如圖形,因為它們無法理解非歐幾何結構。因此,在沒有額外微調的情況下,它們在圖形學習任務中的表現遠遠落後於圖形神經網路 (GNN)。在本文中,我們展示了在圖形資料上學習可以被概念化為檢索增強生成 (RAG) 過程,其中特定實例(例如,節點或邊)充當查詢,而圖形本身則作為檢索的語境。基於這個見解,我們提出了一系列 RAG 架構,以增強 LLM 在圖形學習任務中的語境學習能力。全面的評估表明,我們提出的 RAG 架構顯著提升了 LLM 在基於圖形的任務上的表現,特別是在預訓練的 LLM 必須在不修改或透過 API 存取的情況下使用的場景中。 -##### **Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos** -2502.11481v1 by Xiangxiang Cui, Zhongyu Li, Xiayue Fan, Peng Huang, Ying Wang, Meng Yang, Shi Chang, Jihua Zhu +##### **Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs** +2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu -The intersection of medical imaging and artificial intelligence has become an -important research direction in intelligent medical treatment, particularly in -the analysis of medical images using deep learning for clinical diagnosis. -Despite the advances, existing keyframe classification methods lack extraction -of time series features, while ultrasonic video classification based on -three-dimensional convolution requires uniform frame numbers across patients, -resulting in poor feature extraction efficiency and model classification -performance. This study proposes a novel video classification method based on -CNN and LSTM, introducing NLP's long and short sentence processing scheme into -video classification for the first time. The method reduces CNN-extracted image -features to 1x512 dimension, followed by sorting and compressing feature -vectors for LSTM training. Specifically, feature vectors are sorted by patient -video frame numbers and populated with padding value 0 to form variable -batches, with invalid padding values compressed before LSTM training to -conserve computing resources. Experimental results demonstrate that our -variable-frame CNNLSTM method outperforms other approaches across all metrics, -showing improvements of 3-6% in F1 score and 1.5% in specificity compared to -keyframe methods. The variable-frame CNNLSTM also achieves better accuracy and -precision than equal-frame CNNLSTM. These findings validate the effectiveness -of our approach in classifying variable-frame ultrasound videos and suggest -potential applications in other medical imaging modalities. +Data augmentation is necessary for graph representation learning due to the +scarcity and noise present in graph data. Most of the existing augmentation +methods overlook the context information inherited from the dataset as they +rely solely on the graph structure for augmentation. Despite the success of +some large language model-based (LLM) graph learning methods, they are mostly +white-box which require access to the weights or latent features from the +open-access LLMs, making them difficult to be democratized for everyone as +existing LLMs are mostly closed-source for commercial considerations. To +overcome these limitations, we propose a black-box context-driven graph data +augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the +text prompt as context-related information, we task the LLM with generating +knowledge graphs (KGs), which allow us to capture the structural interactions +from the text outputs. We then design a dynamic merging schema to +stochastically integrate the LLM-generated KGs into the original graph during +training. To control the sparsity of the augmented graph, we further devise a +granularity-aware prompting strategy and an instruction fine-tuning module, +which seamlessly generates text prompts according to different granularity +levels of the dataset. Extensive experiments on various graph learning tasks +validate the effectiveness of our method over existing graph data augmentation +methods. Notably, our approach excels in scenarios involving electronic health +records (EHRs), which validates its maximal utilization of contextual +knowledge, leading to enhanced predictive performance and interpretability. -摘要:醫學影像與人工智慧的交叉領域已成為智慧醫療的重要研究方向,特別是在臨床診斷中使用深度學習分析醫學影像。儘管有進展,現有的關鍵影格分類方法缺乏時間序列特徵的提取,而基於三維卷積的超音波影片分類需要患者之間的均勻影格數,導致特徵提取效率差和模型分類效能不佳。本研究提出了一種基於 CNN 和 LSTM 的新影片分類方法,首次將 NLP 的長短句處理機制引入影片分類中。該方法將 CNN 提取的影像特徵縮減為 1x512 維度,然後對特徵向量進行排序和壓縮以進行 LSTM 訓練。具體來說,特徵向量按患者影片影格數排序,並填充 0 補齊值以形成可變批次,在 LSTM 訓練前壓縮無效的補齊值以節省運算資源。實驗結果表明,我們的可變影格 CNNLSTM 方法在所有指標上都優於其他方法,與關鍵影格方法相比,F1 分數提高了 3-6%,特異性提高了 1.5%。可變影格 CNNLSTM 也比等影格 CNNLSTM 達到了更好的準確度和精確度。這些發現驗證了我們的方法在分類可變影格超音波影片中的有效性,並表明在其他醫學影像模式中具有潛在的應用。 +摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。 -##### **Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation** -2502.11456v1 by Yanyan Wang, Kechen Song, Yuyuan Liu, Shuai Ma, Yunhui Yan, Gustavo Carneiro +##### **PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference** +2502.13502v1 by Burc Gokden -Semi-supervised 3D medical image segmentation aims to achieve accurate -segmentation using few labelled data and numerous unlabelled data. The main -challenge in the design of semi-supervised learning methods consists in the -effective use of the unlabelled data for training. A promising solution -consists of ensuring consistent predictions across different views of the data, -where the efficacy of this strategy depends on the accuracy of the -pseudo-labels generated by the model for this consistency learning strategy. In -this paper, we introduce a new methodology to produce high-quality -pseudo-labels for a consistency learning strategy to address semi-supervised 3D -medical image segmentation. The methodology has three important contributions. -The first contribution is the Cooperative Rectification Learning Network (CRLN) -that learns multiple prototypes per class to be used as external knowledge -priors to adaptively rectify pseudo-labels at the voxel level. The second -contribution consists of the Dynamic Interaction Module (DIM) to facilitate -pairwise and cross-class interactions between prototypes and multi-resolution -image features, enabling the production of accurate voxel-level clues for -pseudo-label rectification. The third contribution is the Cooperative Positive -Supervision (CPS), which optimises uncertain representations to align with -unassertive representations of their class distributions, improving the model's -accuracy in classifying uncertain regions. Extensive experiments on three -public 3D medical segmentation datasets demonstrate the effectiveness and -superiority of our semi-supervised learning method. +We show that Large Language Model from Power Law Decoder Representations +(PLDR-LLM) is a foundational model whose deductive outputs are invariant +tensors up to a small perturbation. PLDR-LLM learns a singularity condition for +the deductive outputs that enable the once-inferred energy-curvature tensor +$\mathbf{G}_{LM}$ to replace the deep neural network of power law graph +attention (PLGA) generating the deductive outputs at inference. We demonstrate +that a cache for $\mathbf{G}_{LM}$ (G-cache) and KV-cache can be implemented in +a straightforward manner to improve the inference time. The invariance and +generalizable nature of deductive outputs is at a very high fidelity where +deductive outputs have same RMSE and determinant values up to 15 decimal places +after caching, and zero-shot benchmark scores remain unchanged. Ablation +studies show that learned deductive outputs have distinct loss and accuracy +characteristics from models pretrained with transferred, randomly initialized +or identity tensors as a constant tensor operator and an LLM with scaled-dot +product attention (SDPA) is a special case of PLDR-LLM where $\mathbf{G}_{LM}$ +is predefined as identity. The observed invariance characteristic introduces a +novel asymmetry between training and inference phases with caching. We outline +observed common characteristics of the deductive outputs for the learned +singularity condition. We provide an implementation of a training and inference +framework for PLDR-LLM with KV-cache and G-cache. -摘要:半监督 3D 医学影像分割旨在使用少量标记数据和大量未标记数据实现精确分割。半监督学习方法设计中的主要挑战在于有效使用未标记数据进行训练。一个有前景的解决方案是确保数据不同视图之间预测的一致性,其中此策略的有效性取决于模型为这种一致性学习策略生成的伪标签的准确性。在本文中,我们引入了一种新的方法来为一致性学习策略生成高质量的伪标签,以解决半监督 3D 医学图像分割问题。该方法有三个重要的贡献。第一个贡献是协作修正学习网络 (CRLN),它为每个类别学习多个原型,用作外部知识先验,以在体素级别自适应地修正伪标签。第二个贡献包括动态交互模块 (DIM),以促进原型和多分辨率图像特征之间的成对和跨类交互,从而能够生成用于伪标签修正的准确体素级线索。第三个贡献是协作正监督 (CPS),它优化不确定的表示以与其类分布的不确定表示保持一致,从而提高模型对不确定区域进行分类的准确性。在三个公共 3D 医学分割数据集上进行的大量实验表明了我们半监督学习方法的有效性和优越性。 +摘要:我們展示了來自冪律解碼器表示 (PLDR-LLM) 的大型語言模型是一個基礎模型,其演繹輸出是直到一個小擾動的不變張量。PLDR-LLM 學習演繹輸出的奇異條件,使曾經推斷出的能量曲率張量 $\mathbf{G}_{LM}$ 能夠取代產生演繹輸出的冪律圖注意力 (PLGA) 深度神經網路,進行推論。我們證明了 $\mathbf{G}_{LM}$ 快取 (G 快取) 和 KV 快取能夠以一種直接的方式實作,以改善推論時間。演繹輸出的不變性和可概化性質具有非常高的保真度,其中演繹輸出在快取後具有相同的 RMSE 和行列式值,直到小數點後 15 位,且零次學習基準分數保持不變。消融研究表明,學習的演繹輸出具有與使用轉移、隨機初始化或恆等張量作為常數張量算子和具有縮放點積注意力的 LLM 預先訓練的模型不同的損失和準確性特徵,並且 $\mathbf{G}_{LM}$ 被預先定義為恆等的 PLDR-LLM 的一個特例,其中 $\mathbf{G}_{LM}$ 被預先定義為恆等。觀察到的不變特徵引入了訓練和推論階段之間一個新的不對稱性,並帶有快取。我們概述了學習的奇異條件演繹輸出的觀察到的共同特徵。我們提供了一個具有 KV 快取和 G 快取的 PLDR-LLM 訓練和推論框架的實作。 -##### **A Survey of LLM-based Agents in Medicine: How far are we from Baymax?** -2502.11211v1 by Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Wenting Chen, Xiang Li, Yixuan Yuan +##### **Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction** +2502.13412v1 by Yanbang Sun, Qing Huang, Xiaoxue Ren, Zhenchang Xing, Xiaohong Li, Junjie Wang -Large Language Models (LLMs) are transforming healthcare through the -development of LLM-based agents that can understand, reason about, and assist -with medical tasks. This survey provides a comprehensive review of LLM-based -agents in medicine, examining their architectures, applications, and -challenges. We analyze the key components of medical agent systems, including -system profiles, clinical planning mechanisms, medical reasoning frameworks, -and external capacity enhancement. The survey covers major application -scenarios such as clinical decision support, medical documentation, training -simulations, and healthcare service optimization. We discuss evaluation -frameworks and metrics used to assess these agents' performance in healthcare -settings. While LLM-based agents show promise in enhancing healthcare delivery, -several challenges remain, including hallucination management, multimodal -integration, implementation barriers, and ethical considerations. The survey -concludes by highlighting future research directions, including advances in -medical reasoning inspired by recent developments in LLM architectures, -integration with physical systems, and improvements in training simulations. -This work provides researchers and practitioners with a structured overview of -the current state and future prospects of LLM-based agents in medicine. +The API Knowledge Graph (API KG) is a structured network that models API +entities and their relations, providing essential semantic insights for tasks +such as API recommendation, code generation, and API misuse detection. However, +constructing a knowledge-rich and reliable API KG presents several challenges. +Existing schema-based methods rely heavily on manual annotations to design KG +schemas, leading to excessive manual overhead. On the other hand, schema-free +methods, due to the lack of schema guidance, are prone to introducing noise, +reducing the KG's reliability. To address these issues, we propose the +Explore-Construct-Filter framework, an automated approach for API KG +construction based on large language models (LLMs). This framework consists of +three key modules: 1) KG exploration: LLMs simulate the workflow of annotators +to automatically design a schema with comprehensive type triples, minimizing +human intervention; 2) KG construction: Guided by the schema, LLMs extract +instance triples to construct a rich yet unreliable API KG; 3) KG filtering: +Removing invalid type triples and suspicious instance triples to construct a +rich and reliable API KG. Experimental results demonstrate that our method +surpasses the state-of-the-art method, achieving a 25.2% improvement in F1 +score. Moreover, the Explore-Construct-Filter framework proves effective, with +the KG exploration module increasing KG richness by 133.6% and the KG filtering +module improving reliability by 26.6%. Finally, cross-model experiments confirm +the generalizability of our framework. -摘要:大型語言模型 (LLM) 透過開發可理解、推理並協助醫療任務的 LLM 基礎代理人,轉變了醫療保健。本調查提供了 LLM 基礎代理人在醫學中的全面回顧,探討其架構、應用和挑戰。我們分析了醫療代理系統的主要組成部分,包括系統概況、臨床規劃機制、醫療推理架構和外部能力提升。本調查涵蓋了主要的應用場景,例如臨床決策支援、醫療文件、訓練模擬和醫療保健服務最佳化。我們討論了用於評估這些代理人在醫療保健環境中表現的評估架構和指標。雖然 LLM 基礎代理人顯示出在增強醫療保健提供方面的潛力,但仍有許多挑戰,包括幻覺管理、多模態整合、實施障礙和倫理考量。本調查最後強調了未來的研究方向,包括受 LLM 架構近期發展啟發的醫療推理進展、與物理系統的整合和訓練模擬的改進。這項工作為研究人員和從業人員提供了 LLM 基礎代理人在醫學中當前狀態和未來前景的結構化概觀。 +摘要:API 知識圖譜 (API KG) 是一個結構化網路,用於建模 API 實體及其關係,提供基本語義見解,以執行 API 建議、程式碼產生和 API 誤用偵測等任務。然而,建構一個知識豐富且可靠的 API KG 會產生若干挑戰。現有的基於架構的方法嚴重依賴手動註解來設計 KG 架構,導致過度的手動開銷。另一方面,由於缺乏架構指導,無架構的方法容易引入雜訊,降低 KG 的可靠性。為了解決這些問題,我們提出了探索建構過濾架構,這是一種基於大型語言模型 (LLM) 的自動化 API KG 建構方法。此架構包含三個關鍵模組:1) KG 探索:LLM 模擬註解者的工作流程,自動設計具有完整類型三元組的架構,將人為干預降至最低;2) KG 建構:在架構的指導下,LLM 提取實例三元組來建構豐富但不可靠的 API KG;3) KG 過濾:移除無效的類型三元組和可疑的實例三元組,以建構豐富且可靠的 API KG。實驗結果表明,我們的方法優於最先進的方法,在 F1 分數上提高了 25.2%。此外,探索建構過濾架構被證明是有效的,其中 KG 探索模組將 KG 豐富度提高了 133.6%,而 KG 過濾模組將可靠性提高了 26.6%。最後,跨模型實驗證實了我們架構的泛化性。 -##### **RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer** -2502.11179v1 by Shilong Yang, Qi Zang, Chulong Zhang, Lingfeng Huang, Yaoqin Xie +##### **Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval** +2502.13369v1 by Aditya Sharma, Luis Lara, Amal Zouaq, Christopher J. Pal -Traditional Chinese acupuncture methods often face controversy in clinical -practice due to their high subjectivity. Additionally, current -intelligent-assisted acupuncture systems have two major limitations: slow -acupoint localization speed and low accuracy. To address these limitations, a -new method leverages the excellent inference efficiency of the state-space -model Mamba, while retaining the advantages of the attention mechanism in the -traditional DETR architecture, to achieve efficient global information -integration and provide high-quality feature information for acupoint -localization tasks. Furthermore, by employing the concept of residual -likelihood estimation, it eliminates the need for complex upsampling processes, -thereby accelerating the acupoint localization task. Our method achieved -state-of-the-art (SOTA) accuracy on a private dataset of acupoints on the human -back, with an average Euclidean distance pixel error (EPE) of 7.792 and an -average time consumption of 10.05 milliseconds per localization task. Compared -to the second-best algorithm, our method improved both accuracy and speed by -approximately 14\%. This significant advancement not only enhances the efficacy -of acupuncture treatment but also demonstrates the commercial potential of -automated acupuncture robot systems. Access to our method is available at -https://github.com/Sohyu1/RT-DEMT +The ability to generate SPARQL queries from natural language questions is +crucial for ensuring efficient and accurate retrieval of structured data from +knowledge graphs (KG). While large language models (LLMs) have been widely +adopted for SPARQL query generation, they are often susceptible to +hallucinations and out-of-distribution errors when producing KG elements like +Uniform Resource Identifiers (URIs) based on internal parametric knowledge. +This often results in content that appears plausible but is factually +incorrect, posing significant challenges for their use in real-world +information retrieval (IR) applications. This has led to increased research +aimed at detecting and mitigating such errors. In this paper, we introduce PGMR +(Post-Generation Memory Retrieval), a modular framework that incorporates a +non-parametric memory module to retrieve KG elements and enhance LLM-based +SPARQL query generation. Our experimental results indicate that PGMR +consistently delivers strong performance across diverse datasets, data +distributions, and LLMs. Notably, PGMR significantly mitigates URI +hallucinations, nearly eliminating the problem in several scenarios. -摘要:傳統的中醫針灸方法由於其高度主觀性,在臨床實務中經常面臨爭議。此外,現有的智慧輔助針灸系統有兩大限制:取穴速度慢以及準確度低。為了解決這些限制,一種新的方法利用了狀態空間模型 Mamba 優異的推理效率,同時保留了傳統 DETR 架構中注意力機制的優點,以實現高效的全局資訊整合,並為取穴任務提供高品質的特徵資訊。此外,透過採用殘差似然估計的概念,它消除了對複雜上採樣程序的需求,從而加速了取穴任務。我們的模型在人體背部穴位私人資料集上達到了最先進 (SOTA) 的準確度,平均歐幾里得距離像素誤差 (EPE) 為 7.792,平均每個取穴任務耗時 10.05 毫秒。與第二好的演算法相比,我們的模型在準確度和速度上都提高了大約 14%。這項重大進展不僅提高了針灸治療的療效,也證明了自動化針灸機器人系統的商業潛力。我們的模型可以在 https://github.com/Sohyu1/RT-DEMT 取得 +摘要:從自然語言問題中產生 SPARQL 查詢的能力對於確保從知識圖譜 (KG) 中有效率且準確地擷取結構化資料至關重要。儘管大型語言模型 (LLM) 已廣泛用於 SPARQL 查詢產生,但它們在根據內部參數化知識產生像統一資源識別碼 (URI) 等 KG 元素時,通常容易出現幻覺和分布外錯誤。這通常會導致內容看似合理,但事實上並不正確,對其在真實世界資訊檢索 (IR) 應用中的使用構成重大挑戰。這導致針對偵測和減輕此類錯誤的研究增加。在本文中,我們介紹 PGMR(後產生記憶體檢索),這是一個模組化架構,它結合了一個非參數記憶體模組來檢索 KG 元素並增強基於 LLM 的 SPARQL 查詢產生。我們的實驗結果表明,PGMR 在不同的資料集、資料分佈和 LLM 中始終提供強大的效能。值得注意的是,PGMR 大幅減輕了 URI 幻覺,在許多情況下幾乎消除了問題。 -##### **Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications** -2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy +##### **Craw4LLM: Efficient Web Crawling for LLM Pretraining** +2502.13347v1 by Shi Yu, Zhiyuan Liu, Chenyan Xiong -Large language models (LLMs) have significantly advanced the field of natural -language generation. However, they frequently generate unverified outputs, -which compromises their reliability in critical applications. In this study, we -propose an innovative framework that combines structured biomedical knowledge -with LLMs through a retrieval-augmented generation technique. Our system -develops a thorough knowledge graph by identifying and refining causal -relationships and named entities from medical abstracts related to age-related -macular degeneration (AMD). Using a vector-based retrieval process and a -locally deployed language model, our framework produces responses that are both -contextually relevant and verifiable, with direct references to clinical -evidence. Experimental results show that this method notably decreases -hallucinations, enhances factual precision, and improves the clarity of -generated responses, providing a robust solution for advanced biomedical -chatbot applications. +Web crawl is a main source of large language models' (LLMs) pretraining data, +but the majority of crawled web pages are discarded in pretraining due to low +data quality. This paper presents Crawl4LLM, an efficient web crawling method +that explores the web graph based on the preference of LLM pretraining. +Specifically, it leverages the influence of a webpage in LLM pretraining as the +priority score of the web crawler's scheduler, replacing the standard graph +connectivity based priority. Our experiments on a web graph containing 900 +million webpages from a commercial search engine's index demonstrate the +efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just +21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream +performances of previous crawls, significantly reducing the crawling waste and +alleviating the burdens on websites. Our code is publicly available at +https://github.com/cxcscmu/Crawl4LLM. -摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。 +摘要:網路爬蟲是大型語言模型 (LLM) 預訓練資料的主要來源, +但大多數已爬取的網頁在預訓練中會因為資料品質低落而被捨棄。 +本文提出 Crawl4LLM,這是一種有效率的網路爬取方法, +它會根據 LLM 預訓練的偏好來探索網路圖。 +具體來說,它利用網頁在 LLM 預訓練中的影響力作為網路爬蟲排程器的優先分數, +取代標準的圖形連線優先順序。 +我們在一個包含來自商業搜尋引擎索引的 9 億個網頁的網路圖上進行的實驗, +證明了 Crawl4LLM 在取得高品質預訓練資料方面的效率。 +只爬取了 21% 的網址,以 Crawl4LLM 資料預訓練的 LLM 就達到了先前爬取的相同下游效能, +大幅減少了爬取浪費,並減輕了對網站的負擔。 +我們的程式碼已公開於 https://github.com/cxcscmu/Crawl4LLM。 -##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration** -2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang +##### **K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction** +2502.13344v1 by Tassallah Abdullahi, Ioanna Gemou, Nihal V. Nayak, Ghulam Murtaza, Stephen H. Bach, Carsten Eickhoff, Ritambhara Singh -Automatic depression detection provides cues for early clinical intervention -by clinicians. Clinical interviews for depression detection involve dialogues -centered around multiple themes. Existing studies primarily design end-to-end -neural network models to capture the hierarchical structure of clinical -interview dialogues. However, these methods exhibit defects in modeling the -thematic content of clinical interviews: 1) they fail to capture intra-theme -and inter-theme correlation explicitly, and 2) they do not allow clinicians to -intervene and focus on themes of interest. To address these issues, this paper -introduces an interactive depression detection framework. This framework -leverages in-context learning techniques to identify themes in clinical -interviews and then models both intra-theme and inter-theme correlation. -Additionally, it employs AI-driven feedback to simulate the interests of -clinicians, enabling interactive adjustment of theme importance. PDIMC achieves -absolute improvements of 35\% and 12\% compared to the state-of-the-art on the -depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of -modeling theme correlation and incorporating interactive external feedback. +Drug discovery is a complex and time-intensive process that requires +identifying and validating new therapeutic candidates. Computational approaches +using large-scale biomedical knowledge graphs (KGs) offer a promising solution +to accelerate this process. However, extracting meaningful insights from +large-scale KGs remains challenging due to the complexity of graph traversal. +Existing subgraph-based methods are tailored to graph neural networks (GNNs), +making them incompatible with other models, such as large language models +(LLMs). We introduce K-Paths, a retrieval framework that extracts structured, +diverse, and biologically meaningful paths from KGs. Integrating these paths +enables LLMs and GNNs to effectively predict unobserved drug-drug and +drug-disease interactions. Unlike traditional path-ranking approaches, K-Paths +retrieves and transforms paths into a structured format that LLMs can directly +process, facilitating explainable reasoning. K-Paths employs a diversity-aware +adaptation of Yen's algorithm to retrieve the K shortest loopless paths between +entities in an interaction query, prioritizing biologically relevant and +diverse relationships. Our experiments on benchmark datasets show that K-Paths +improves the zero-shot performance of Llama 8.1B's F1-score by 12.45 points on +drug repurposing and 13.42 points on interaction severity prediction. We also +show that Llama 70B achieves F1-score gains of 6.18 and 8.46 points, +respectively. K-Paths also improves the supervised training efficiency of +EmerGNN, a state-of-the-art GNN, by reducing KG size by 90% while maintaining +strong predictive performance. Beyond its scalability and efficiency, K-Paths +uniquely bridges the gap between KGs and LLMs, providing explainable rationales +for predicted interactions. These capabilities show that K-Paths is a valuable +tool for efficient data-driven drug discovery. -摘要:自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而,這些方法在建模臨床訪談的主題內容時表現出缺陷:1)它們無法明確捕捉主題內和主題間的關聯性,以及 2)它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題,本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題,然後對主題內和主題間的關聯性進行建模。此外,它採用 AI 驅動的回饋來模擬臨床醫師的興趣,實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比,PDIMC 的絕對改進率分別為 35% 和 12%,這證明了對主題關聯性建模和納入互動式外部回饋的有效性。 +摘要:藥物發現是一個複雜且耗時的過程,需要識別和驗證新的治療候選藥物。使用大型生物醫學知識圖譜 (KG) 的計算方法提供了一個有希望的解決方案來加速這個過程。然而,由於圖形遍歷的複雜性,從大型 KG 中提取有意義的見解仍然具有挑戰性。現有的子圖方法是針對圖神經網路 (GNN) 量身打造的,這使得它們與其他模型(例如大型語言模型 (LLM))不兼容。我們介紹了 K-Paths,這是一個檢索框架,它從 KG 中提取結構化、多樣化且具有生物意義的路徑。整合這些路徑使 LLM 和 GNN 能夠有效預測未觀察到的藥物-藥物和藥物-疾病交互。與傳統的路徑排序方法不同,K-Paths 檢索路徑並將其轉換為 LLM 可以直接處理的結構化格式,從而促進可解釋的推理。K-Paths 採用了 Yen 演算法的多樣性感知適應,以檢索交互查詢中實體之間的 K 個最短無環路徑,優先考慮生物相關且多樣化的關係。我們在基準資料集上的實驗表明,K-Paths 將 Llama 8.1B 的 F1 分數在藥物再利用上提高了 12.45 分,在交互嚴重性預測上提高了 13.42 分。我們還表明,Llama 70B 分別獲得了 6.18 分和 8.46 分的 F1 分數增益。K-Paths 還提高了最先進的 GNN EmerGNN 的監督訓練效率,同時將 KG 大小減少了 90%,同時保持強大的預測性能。除了其可擴展性和效率之外,K-Paths 獨特地彌合了 KG 和 LLM 之間的差距,為預測的交互提供了可解釋的依據。這些功能表明,K-Paths 是用於高效資料驅動藥物發現的寶貴工具。 + +##### **Grounding LLM Reasoning with Knowledge Graphs** +2502.13247v1 by Alfonso Amayuelas, Joy Sain, Simerjot Kaur, Charese Smiley + +Knowledge Graphs (KGs) are valuable tools for representing relationships +between entities in a structured format. Traditionally, these knowledge bases +are queried to extract specific information. However, question-answering (QA) +over such KGs poses a challenge due to the intrinsic complexity of natural +language compared to the structured format and the size of these graphs. +Despite these challenges, the structured nature of KGs can provide a solid +foundation for grounding the outputs of Large Language Models (LLMs), offering +organizations increased reliability and control. + Recent advancements in LLMs have introduced reasoning methods at inference +time to improve their performance and maximize their capabilities. In this +work, we propose integrating these reasoning strategies with KGs to anchor +every step or "thought" of the reasoning chains in KG data. Specifically, we +evaluate both agentic and automated search methods across several reasoning +strategies, including Chain-of-Thought (CoT), Tree-of-Thought (ToT), and +Graph-of-Thought (GoT), using GRBench, a benchmark dataset for graph reasoning +with domain-specific graphs. Our experiments demonstrate that this approach +consistently outperforms baseline models, highlighting the benefits of +grounding LLM reasoning processes in structured KG data. + +摘要:知識圖譜 (KG) 是以結構化格式表示實體之間關係的寶貴工具。傳統上,這些知識庫會被查詢以萃取特定資訊。然而,由於自然語言與結構化格式之間的內在複雜性,以及這些圖譜的規模,在這些 KG 上進行問答 (QA) 會構成挑戰。儘管有這些挑戰,KG 的結構化特性可以為大型語言模型 (LLM) 的輸出提供穩固的基礎,為組織提供更高的可靠性和控制力。 +LLM 的最新進展在推論時間引入了推理方法,以提升其效能並最大化其能力。在這項工作中,我們建議將這些推理策略與 KG 整合,以將推理鏈的每一步或「思考」錨定在 KG 資料中。具體來說,我們在多種推理策略中評估代理和自動化搜尋方法,包括思考鏈 (CoT)、思考樹 (ToT) 和思考圖 (GoT),使用 GRBench,這是一個針對圖形推理的基準資料集,其中包含特定領域的圖形。我們的實驗證明,這種方法始終優於基準模型,突顯了將 LLM 推理過程建立在結構化 KG 資料中的好處。 -##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening** -2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu +##### **Learning to Defer for Causal Discovery with Imperfect Experts** +2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin -Due to the rise in antimicrobial resistance, identifying novel compounds with -antibiotic potential is crucial for combatting this global health issue. -However, traditional drug development methods are costly and inefficient. -Recognizing the pressing need for more effective solutions, researchers have -turned to machine learning techniques to streamline the prediction and -development of novel antibiotic compounds. While foundation models have shown -promise in antibiotic discovery, current mainstream efforts still fall short of -fully leveraging the potential of multimodal molecular data. Recent studies -suggest that contrastive learning frameworks utilizing multimodal data exhibit -excellent performance in representation learning across various domains. -Building upon this, we introduce CL-MFAP, an unsupervised contrastive learning -(CL)-based multimodal foundation (MF) model specifically tailored for -discovering small molecules with potential antibiotic properties (AP) using -three types of molecular data. This model employs 1.6 million bioactive -molecules with drug-like properties from the ChEMBL dataset to jointly pretrain -three encoders: (1) a transformer-based encoder with rotary position embedding -for processing SMILES strings; (2) another transformer-based encoder, -incorporating a novel bi-level routing attention mechanism to handle molecular -graph representations; and (3) a Morgan fingerprint encoder using a multilayer -perceptron, to achieve the contrastive learning purpose. The CL-MFAP -outperforms baseline models in antibiotic property prediction by effectively -utilizing different molecular modalities and demonstrates superior -domain-specific performance when fine-tuned for antibiotic-related property -prediction tasks. +Integrating expert knowledge, e.g. from large language models, into causal +discovery algorithms can be challenging when the knowledge is not guaranteed to +be correct. Expert recommendations may contradict data-driven results, and +their reliability can vary significantly depending on the domain or specific +query. Existing methods based on soft constraints or inconsistencies in +predicted causal relationships fail to account for these variations in +expertise. To remedy this, we propose L2D-CD, a method for gauging the +correctness of expert recommendations and optimally combining them with +data-driven causal discovery results. By adapting learning-to-defer (L2D) +algorithms for pairwise causal discovery (CD), we learn a deferral function +that selects whether to rely on classical causal discovery methods using +numerical data or expert recommendations based on textual meta-data. We +evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its +superior performance compared to both the causal discovery method and the +expert used in isolation. Moreover, our approach identifies domains where the +expert's performance is strong or weak. Finally, we outline a strategy for +generalizing this approach to causal discovery on graphs with more than two +variables, paving the way for further research in this area. -摘要:由於抗菌藥物抗性上升,找出具有抗生素潛力的新型化合物對於對抗此項全球性健康議題至關重要。不過,傳統的藥物開發方法成本高昂且效率不彰。研究人員體認到對於更有效解決方案的迫切需求,因此轉向機器學習技術來簡化新型抗生素化合物的預測和開發。儘管基礎模型在抗生素發現方面展現潛力,目前的普遍做法仍未充分利用多模態分子資料的潛力。最近的研究顯示,利用多模態資料的對比學習架構在各種領域的表徵學習中展現出優異的效能。有鑑於此,我們引進 CL-MFAP,一種無監督對比學習 (CL) 為基礎的多模態基礎 (MF) 模型,專門用於使用三種類型的分子資料發現具有潛在抗生素特性的低分子。此模型採用 ChEMBL 資料集中的 160 萬個具有類藥物特性的生物活性分子,以聯合預訓練三個編碼器:(1) 一個具有旋轉位置嵌入的基於Transformer的編碼器,用於處理 SMILES 字串;(2) 另一個基於Transformer的編碼器,結合一種新穎的雙層路由注意機制來處理分子圖表表徵;以及 (3) 一個使用多層感知器的 Morgan 指紋編碼器,以達成對比學習的目的。CL-MFAP 透過有效利用不同的分子模式在抗生素特性預測方面優於基準模型,並且在針對抗生素相關特性預測任務進行微調時展現出優異的特定領域效能。 +摘要:整合专家知識,例如從大型語言模型中整合到因果發現演算法中,當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾,而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點,我們提出了 L2D-CD,一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD),我們學習了一個延遲函數,用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD,並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外,我們的做法識別出專家表現強或弱的領域。最後,我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略,為此領域的進一步研究鋪平了道路。 -##### **Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images** -2502.10908v1 by Sevim Cengiz, Ibraheem Hamdi, Mohammad Yaqub +##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks** +2502.13025v1 by Markus J. Buehler -Fetal gestational age (GA) is vital clinical information that is estimated -during pregnancy in order to assess fetal growth. This is usually performed by -measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan -which is then correlated with fetal age and growth trajectory. A major issue -when performing the CRL measurement is ensuring that the image is acquired at -the correct view, otherwise it could be misleading. Although clinical -guidelines specify the criteria for the correct CRL view, sonographers may not -regularly adhere to such rules. In this paper, we propose a new deep -learning-based solution that is able to verify the adherence of a CRL image to -clinical guidelines in order to assess image quality and facilitate accurate -estimation of GA. We first segment out important fetal structures then use the -localized structures to perform a clinically-guided mapping that verifies the -adherence of criteria. The segmentation method combines the benefits of -Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment -fetal structures in ultrasound images and localize important fetal landmarks. -For segmentation purposes, we compare our proposed work with UNet and show that -our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, -we compare the output of the mapping with classification CNNs when assessing -the clinical criteria and the overall acceptability of CRL images. We show that -the proposed mapping is not only explainable but also more accurate than the -best performing classification CNNs. +We present an agentic, autonomous graph expansion framework that iteratively +structures and refines knowledge in situ. Unlike conventional knowledge graph +construction methods relying on static extraction or single-pass learning, our +approach couples a reasoning-native large language model with a continually +updated graph representation. At each step, the system actively generates new +concepts and relationships, merges them into a global graph, and formulates +subsequent prompts based on its evolving structure. Through this +feedback-driven loop, the model organizes information into a scale-free network +characterized by hub formation, stable modularity, and bridging nodes that link +disparate knowledge clusters. Over hundreds of iterations, new nodes and edges +continue to appear without saturating, while centrality measures and shortest +path distributions evolve to yield increasingly distributed connectivity. Our +analysis reveals emergent patterns, such as the rise of highly connected 'hub' +concepts and the shifting influence of 'bridge' nodes, indicating that agentic, +self-reinforcing graph construction can yield open-ended, coherent knowledge +structures. Applied to materials design problems, we present compositional +reasoning experiments by extracting node-specific and synergy-level principles +to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that +transcend rote summarization and strengthen the framework's potential for +open-ended scientific discovery. We discuss other applications in scientific +discovery and outline future directions for enhancing scalability and +interpretability. -摘要:胎兒妊娠年齡 (GA) 是重要的臨床資訊,會在懷孕期間估計,以評估胎兒生長。這通常是透過在約會掃描中測量超音波影像中的頭臀長度 (CRL) 來執行,然後與胎兒年齡和生長軌跡相關聯。執行 CRL 測量時的一個主要問題是確保影像是在正確的視角下取得,否則可能會產生誤導。儘管臨床指南規定了正確 CRL 視角的標準,但超音波檢查員可能不會定期遵守這些規則。在本文中,我們提出了一個新的深度學習解決方案,能夠驗證 CRL 影像是否符合臨床指南,以評估影像品質並促進對 GA 的準確估計。我們首先分割出重要的胎兒結構,然後使用局部結構來執行臨床指導的對應,以驗證標準的遵守情況。分割方法結合了卷積神經網路 (CNN) 和視覺轉換器 (ViT) 的優點,以分割超音波影像中的胎兒結構並定位重要的胎兒標誌。為了分割目的,我們將我們提出的工作與 UNet 進行比較,並顯示我們基於 CNN/ViT 的方法優於 UNet 的最佳化版本。此外,我們在評估臨床標準和 CRL 影像的整體可接受性時,將對應的輸出與分類 CNN 進行比較。我們表明,所提出的對應不僅可以解釋,而且比效能最佳的分類 CNN 更準確。 +摘要:我們提出一個能動的、自主的圖形擴展框架,它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同,我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中,系統主動產生新的概念和關係,將它們合併到一個全域圖形中,並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈,模型將資訊組織成一個無標度網路,其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中,新的節點和邊緣會持續出現,而不會飽和,同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式,例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移,這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題,我們提出組合推理實驗,透過提取特定於節點的原則和協同效應層級原則,以促進真正新穎的知識綜合,產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用,並概述了增強可擴充性和可解釋性的未來方向。 -##### **Breaking Down the Hierarchy: A New Approach to Leukemia Classification** -2502.10899v1 by Ibraheem Hamdi, Hosam El-Gendy, Ahmed Sharshar, Mohamed Saeed, Muhammad Ridzuan, Shahrukh K. Hashmi, Naveed Syed, Imran Mirza, Shakir Hussain, Amira Mahmoud Abdalla, Mohammad Yaqub +##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge** +2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany -The complexities inherent to leukemia, multifaceted cancer affecting white -blood cells, pose considerable diagnostic and treatment challenges, primarily -due to reliance on laborious morphological analyses and expert judgment that -are susceptible to errors. Addressing these challenges, this study presents a -refined, comprehensive strategy leveraging advanced deep-learning techniques -for the classification of leukemia subtypes. We commence by developing a -hierarchical label taxonomy, paving the way for differentiating between various -subtypes of leukemia. The research further introduces a novel hierarchical -approach inspired by clinical procedures capable of accurately classifying -diverse types of leukemia alongside reactive and healthy cells. An integral -part of this study involves a meticulous examination of the performance of -Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) as -classifiers. The proposed method exhibits an impressive success rate, achieving -approximately 90\% accuracy across all leukemia subtypes, as substantiated by -our experimental results. A visual representation of the experimental findings -is provided to enhance the model's explainability and aid in understanding the -classification process. +Large Language Models (LLMs) have significantly advanced medical +question-answering by leveraging extensive clinical data and medical +literature. However, the rapid evolution of medical knowledge and the +labor-intensive process of manually updating domain-specific resources pose +challenges to the reliability of these systems. To address this, we introduce +Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates +the construction and continuous updating of medical knowledge graphs, +integrates reasoning, and retrieves current external evidence, such as PubMed +and WikiSearch. By dynamically linking new findings and complex medical +concepts, AMG-RAG not only improves accuracy but also enhances interpretability +in medical queries. + Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness +of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of +66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to +100 times larger. Notably, these improvements are achieved without increasing +computational overhead, highlighting the critical role of automated knowledge +graph generation and external evidence retrieval in delivering up-to-date, +trustworthy medical insights. -摘要:白血病的复杂性源于它是一种影响白血球的多面性癌症,主要由于依赖费力的形态分析和容易出错的专家判断,因此带来了相当大的诊断和治疗挑战。为了应对这些挑战,本研究提出了一种精细且全面的策略,利用先进的深度学习技术对白血病亚型进行分类。我们首先开发了一个分层的标签分类法,为区分白血病的各种亚型铺平了道路。该研究进一步引入了一种新颖的分层方法,该方法受临床程序的启发,能够准确地对各种类型的白血病以及反应性和健康细胞进行分类。本研究的一个组成部分涉及对卷积神经网络 (CNN) 和视觉变压器 (ViT) 作为分类器的性能进行细致检查。所提出的方法展示了令人印象深刻的成功率,在所有白血病亚型中实现了大约 90% 的准确率,我们的实验结果证实了这一点。提供了实验结果的可视化表示,以增强模型的可解释性并帮助理解分类过程。 +摘要:大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻,大幅提升了醫療問題解答的進步。然而,醫療知識的快速演進和手動更新特定領域資源的繁複程序,對這些系統的可靠性構成挑戰。為了解決這個問題,我們引入了適應性醫療圖表 RAG (AMG-RAG),這是一個自動化建構和持續更新醫療知識圖表的綜合架構,整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念,AMG-RAG 不僅提升了準確性,也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性,在 MEDQA 上達到了 74.1% 的 F1 分數,在 MEDMCQA 上達到了 66.34% 的準確度,優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是,這些改進是在不增加運算負擔的情況下實現的,突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。 -##### **An Empirical Analysis of Uncertainty in Large Language Model Evaluations** -2502.10709v1 by Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang +##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs** +2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi -As LLM-as-a-Judge emerges as a new paradigm for assessing large language -models (LLMs), concerns have been raised regarding the alignment, bias, and -stability of LLM evaluators. While substantial work has focused on alignment -and bias, little research has concentrated on the stability of LLM evaluators. -In this paper, we conduct extensive experiments involving 9 widely used LLM -evaluators across 2 different evaluation settings to investigate the -uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators -exhibit varying uncertainty based on model families and sizes. With careful -comparative analyses, we find that employing special prompting strategies, -whether during inference or post-training, can alleviate evaluation uncertainty -to some extent. By utilizing uncertainty to enhance LLM's reliability and -detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an -uncertainty-aware LLM evaluator named ConfiLM using a human-annotated -fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually -designed test set sourced from the 2024 Olympics. Experimental results -demonstrate that incorporating uncertainty as additional information during the -fine-tuning phase can largely improve the model's evaluation performance in OOD -scenarios. The code and data are released at: -https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty. +Recent studies have combined Large Language Models (LLMs) with Knowledge +Graphs (KGs) to enhance reasoning, improving inference accuracy without +additional training while mitigating hallucination. However, existing +frameworks are often rigid, struggling to adapt to KG or task changes. They +also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. +To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that +separates reasoning into two roles: an Operator (a low-capacity LLM) that +gathers evidence and a Supervisor (a high-capacity LLM) that makes final +judgments. This design is cost-efficient for LLM inference while still +maintaining strong reasoning accuracy. Additionally, R2-KG employs an +Abstention mechanism, generating answers only when sufficient evidence is +collected from KG, which significantly enhances reliability. Experiments across +multiple KG-based reasoning tasks show that R2-KG consistently outperforms +baselines in both accuracy and reliability, regardless of the inherent +capability of LLMs used as the Operator. Further experiments reveal that the +single-agent version of R2-KG, equipped with a strict self-consistency +strategy, achieves significantly higher-than-baseline reliability while +reducing inference cost. However, it also leads to a higher abstention rate in +complex KGs. Our findings establish R2-KG as a flexible and cost-effective +solution for KG-based reasoning. It reduces reliance on high-capacity LLMs +while ensuring trustworthy inference. -摘要:隨著 LLM 作為法官的新典範出現,用於評估大型語言模型 (LLM) 的 LLM 評估器在對齊、偏差和穩定性方面引發了關注。儘管大量工作集中在對齊和偏差上,但很少有研究集中在 LLM 評估器的穩定性上。在本文中,我們進行了廣泛的實驗,涉及 9 個廣泛使用的 LLM 評估器,跨越 2 個不同的評估設定,以調查基於模型的 LLM 評估中的不確定性。我們精確指出 LLM 評估器根據模型系列和大小表現出不同的不確定性。通過仔細的比較分析,我們發現採用特殊的提示策略(無論是在推理過程中還是訓練後)可以在一定程度上緩解評估不確定性。通過利用不確定性來增強 LLM 在 Out-Of-Distribution (OOD) 數據中的可靠性和檢測能力,我們進一步微調了一個名為 ConfiLM 的不確定性感知 LLM 評估器,使用人工註釋的微調設置,並評估 ConfiLM 在手動設計的、來自 2024 年奧運會的測試集上的 OOD 評估能力。實驗結果表明,在微調階段將不確定性作為附加信息納入其中可以在很大程度上提高模型在 OOD 場景中的評估性能。代碼和數據發布於: -https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty。 +摘要:最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理,在不额外训练的情况下提高推理准确性,同时减轻幻觉。然而,现有的框架通常很僵化,难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠(即值得信赖)的推理。为了解决这个问题,我们引入了 R2-KG,这是一个即插即用、双代理框架,它将推理分为两个角色:一个收集证据的操作员(低容量 LLM)和一个做出最终判断的监督员(高容量 LLM)。这种设计在 LLM 推理方面具有成本效益,同时仍保持强大的推理准确性。此外,R2-KG 采用弃权机制,仅在从知识图谱收集到足够证据时才生成答案,这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明,R2-KG 在准确性和可靠性方面始终优于基线,而与用作操作员的 LLM 的固有能力无关。进一步的实验表明,R2-KG 的单代理版本配备了严格的自一致性策略,实现了明显高于基线的可靠性,同时降低了推理成本。然而,它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖,同时确保了可信的推理。 -##### **Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model** -2502.10707v1 by Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong +##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research** +2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang -Electrocardiogram (ECG) is essential for the clinical diagnosis of -arrhythmias and other heart diseases, but deep learning methods based on ECG -often face limitations due to the need for high-quality annotations. Although -previous ECG self-supervised learning (eSSL) methods have made significant -progress in representation learning from unannotated ECG data, they typically -treat ECG signals as ordinary time-series data, segmenting the signals using -fixed-size and fixed-step time windows, which often ignore the form and rhythm -characteristics and latent semantic relationships in ECG signals. In this work, -we introduce a novel perspective on ECG signals, treating heartbeats as words -and rhythms as sentences. Based on this perspective, we first designed the -QRS-Tokenizer, which generates semantically meaningful ECG sentences from the -raw ECG signals. Building on these, we then propose HeartLang, a novel -self-supervised learning framework for ECG language processing, learning -general representations at form and rhythm levels. Additionally, we construct -the largest heartbeat-based ECG vocabulary to date, which will further advance -the development of ECG language processing. We evaluated HeartLang across six -public ECG datasets, where it demonstrated robust competitiveness against other -eSSL methods. Our data and code are publicly available at -https://github.com/PKUDigitalHealth/HeartLang. +The rapid advancement of perovskite solar cells (PSCs) has led to an +exponential growth in research publications, creating an urgent need for +efficient knowledge management and reasoning systems in this domain. We present +a comprehensive knowledge-enhanced system for PSCs that integrates three key +components. First, we develop Perovskite-KG, a domain-specific knowledge graph +constructed from 1,517 research papers, containing 23,789 entities and 22,272 +relationships. Second, we create two complementary datasets: Perovskite-Chat, +comprising 55,101 high-quality question-answer pairs generated through a novel +multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully +curated materials science problems. Third, we introduce two specialized large +language models: Perovskite-Chat-LLM for domain-specific knowledge assistance +and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental +results demonstrate that our system significantly outperforms existing models +in both domain-specific knowledge retrieval and scientific reasoning tasks, +providing researchers with effective tools for literature review, experimental +design, and complex problem-solving in PSC research. -摘要:心電圖 (ECG) 對於心律不整和其他心臟疾病的臨床診斷至關重要,但基於心電圖的深度學習方法通常會因需要高品質註解而面臨限制。儘管先前的 ECG 自我監督學習 (eSSL) 方法在從未註解的 ECG 資料中學習表徵方面取得顯著進展,但它們通常將 ECG 訊號視為普通的時間序列資料,使用固定大小和固定步長的時窗對訊號進行分段,這通常會忽略 ECG 訊號中的形式和節律特徵以及潛在的語義關係。在這項工作中,我們對 ECG 訊號引入了新的觀點,將心跳視為單字,將節律視為句子。基於此觀點,我們首先設計了 QRS-Tokenizer,它從原始 ECG 訊號中產生語義有意義的 ECG 句子。在此基礎上,我們提出了 HeartLang,一種用於 ECG 語言處理的新型自我監督學習框架,在形式和節律層面上學習一般表徵。此外,我們構建了迄今為止最大的基於心跳的 ECG 詞彙表,這將進一步促進 ECG 語言處理的發展。我們在六個公開的 ECG 資料集上評估了 HeartLang,它展示了與其他 eSSL 方法相比的強大競爭力。我們的資料和程式碼可在 https://github.com/PKUDigitalHealth/HeartLang 公開取得。 +摘要:由於 perovskite 太陽能電池 (PSC) 快速進展,導致研究出版物呈指數成長,迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先,我們開發出 Perovskite-KG,一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次,我們建立兩個互補的資料集:Perovskite-Chat,包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對;以及 Perovskite-Reasoning,包含 2,217 個仔細策展的材料科學問題。第三,我們推出兩個專門化大型語言模型:針對領域特定知識協助的 Perovskite-Chat-LLM,以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示,我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型,為研究人員提供有效的工具,用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。 -##### **Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction** -2502.10689v1 by Leisheng Yu, Yanxiao Cai, Minxing Zhang, Xia Hu +##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation** +2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li -The burgeoning volume of electronic health records (EHRs) has enabled deep -learning models to excel in predictive healthcare. However, for high-stakes -applications such as diagnosis prediction, model interpretability remains -paramount. Existing deep learning diagnosis prediction models with intrinsic -interpretability often assign attention weights to every past diagnosis or -hospital visit, providing explanations lacking flexibility and succinctness. In -this paper, we introduce SHy, a self-explaining hypergraph neural network -model, designed to offer personalized, concise and faithful explanations that -allow for interventions from clinical experts. By modeling each patient as a -unique hypergraph and employing a message-passing mechanism, SHy captures -higher-order disease interactions and extracts distinct temporal phenotypes as -personalized explanations. It also addresses the incompleteness of the EHR data -by accounting for essential false negatives in the original diagnosis record. A -qualitative case study and extensive quantitative evaluations on two real-world -EHR datasets demonstrate the superior predictive performance and -interpretability of SHy over existing state-of-the-art models. +Explainable recommendation has demonstrated significant advantages in +informing users about the logic behind recommendations, thereby increasing +system transparency, effectiveness, and trustworthiness. To provide +personalized and interpretable explanations, existing works often combine the +generation capabilities of large language models (LLMs) with collaborative +filtering (CF) information. CF information extracted from the user-item +interaction graph captures the user behaviors and preferences, which is crucial +for providing informative explanations. However, due to the complexity of graph +structure, effectively extracting the CF information from graphs still remains +a challenge. Moreover, existing methods often struggle with the integration of +extracted CF information with LLMs due to its implicit representation and the +modality gap between graph structures and natural language explanations. To +address these challenges, we propose G-Refer, a framework using graph +retrieval-augmented large language models (LLMs) for explainable +recommendation. Specifically, we first employ a hybrid graph retrieval +mechanism to retrieve explicit CF signals from both structural and semantic +perspectives. The retrieved CF information is explicitly formulated as +human-understandable text by the proposed graph translation and accounts for +the explanations generated by LLMs. To bridge the modality gap, we introduce +knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of +LLMs to process and utilize the retrieved CF information to generate +explanations. Extensive experiments show that G-Refer achieves superior +performance compared with existing methods in both explainability and +stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer. -摘要:隨著電子健康紀錄 (EHR) 數量的激增,深度學習模型在預測保健方面表現出色。然而,對於診斷預測等高風險應用,模型的可解釋性仍然至關重要。現有的具有內在可解釋性的深度學習診斷預測模型通常會為每個過去的診斷或醫院就診分配注意力權重,提供的解釋缺乏靈活性且簡潔性。在本文中,我們介紹了 SHy,這是一個自解釋的超圖神經網路模型,旨在提供個性化、簡潔且忠實的解釋,讓臨床專家可以進行干預。通過將每個患者建模為一個獨特的超圖並採用訊息傳遞機制,SHy 捕捉到了高階疾病交互作用,並提取出不同的時間表型作為個性化解釋。它還通過考慮原始診斷記錄中的基本假陰性來解決電子健康紀錄資料的不完整性。對兩個真實世界電子健康紀錄資料集進行的定性案例研究和廣泛的定量評估表明,SHy 在預測效能和可解釋性方面優於現有的最先進模型。 +摘要:可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點,從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明,現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好,這對於提供資訊性說明至關重要。然而,由於圖形結構的複雜性,從圖形中有效提取 CF 資訊仍然是一個挑戰。此外,現有方法通常難以將提取的 CF 資訊與 LLM 整合,因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰,我們提出 G-Refer,一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說,我們首先採用混合圖形檢索機制,從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字,並說明 LLM 生成的解釋。為了彌合模式差距,我們引入了知識修剪和檢索增強微調,以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明,與現有方法相比,G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。 -##### **ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis** -2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan +##### **A-MEM: Agentic Memory for LLM Agents** +2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang -Recent advancements in large language models (LLMs) have demonstrated -extraordinary comprehension capabilities with remarkable breakthroughs on -various vision-language tasks. However, the application of LLMs in generating -reliable medical diagnostic reports remains in the early stages. Currently, -medical LLMs typically feature a passive interaction model where doctors -respond to patient queries with little or no involvement in analyzing medical -images. In contrast, some ChatBots simply respond to predefined queries based -on visual inputs, lacking interactive dialogue or consideration of medical -history. As such, there is a gap between LLM-generated patient-ChatBot -interactions and those occurring in actual patient-doctor consultations. To -bridge this gap, we develop an LLM-based dialogue system, namely proactive -multi-round vision-language interactions for computer-aided diagnosis -(ProMRVL-CAD), to generate patient-friendly disease diagnostic reports. The -proposed ProMRVL-CAD system allows proactive dialogue to provide patients with -constant and reliable medical access via an integration of knowledge graph into -a recommendation system. Specifically, we devise two generators: a Proactive -Question Generator (Pro-Q Gen) to generate proactive questions that guide the -diagnostic procedure and a Multi-Vision Patient-Text Diagnostic Report -Generator (MVP-DR Gen) to produce high-quality diagnostic reports. Evaluating -two real-world publicly available datasets, MIMIC-CXR and IU-Xray, our model -has better quality in generating medical reports. We further demonstrate the -performance of ProMRVL achieves robust under the scenarios with low image -quality. Moreover, we have created a synthetic medical dialogue dataset that -simulates proactive diagnostic interactions between patients and doctors, -serving as a valuable resource for training LLM. +While large language model (LLM) agents can effectively use external tools +for complex real-world tasks, they require memory systems to leverage +historical experiences. Current memory systems enable basic storage and +retrieval but lack sophisticated memory organization, despite recent attempts +to incorporate graph databases. Moreover, these systems' fixed operations and +structures limit their adaptability across diverse tasks. To address this +limitation, this paper proposes a novel agentic memory system for LLM agents +that can dynamically organize memories in an agentic way. Following the basic +principles of the Zettelkasten method, we designed our memory system to create +interconnected knowledge networks through dynamic indexing and linking. When a +new memory is added, we generate a comprehensive note containing multiple +structured attributes, including contextual descriptions, keywords, and tags. +The system then analyzes historical memories to identify relevant connections, +establishing links where meaningful similarities exist. Additionally, this +process enables memory evolution - as new memories are integrated, they can +trigger updates to the contextual representations and attributes of existing +historical memories, allowing the memory network to continuously refine its +understanding. Our approach combines the structured organization principles of +Zettelkasten with the flexibility of agent-driven decision making, allowing for +more adaptive and context-aware memory management. Empirical experiments on six +foundation models show superior improvement against existing SOTA baselines. +The source code is available at https://github.com/WujiangXu/AgenticMemory. -摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。 +摘要:大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務,但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索,但缺乏精密的記憶體組織,儘管最近嘗試納入圖形資料庫。此外,這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制,本文提出了一種新的代理記憶體系統,供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則,我們設計我們的記憶體系統,透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時,我們會產生包含多個結構化屬性的綜合筆記,包括脈絡描述、關鍵字和標籤。然後,系統會分析歷史記憶體以找出相關連結,在有意義的相似性時建立連結。此外,這個程序能讓記憶體演化,因為當整合新的記憶體時,它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新,讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性,能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。 -##### **Optimizing CNN Architectures for Advanced Thoracic Disease Classification** -2502.10614v1 by Tejas Mirthipati +##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs** +2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li -Machine learning, particularly convolutional neural networks (CNNs), has -shown promise in medical image analysis, especially for thoracic disease -detection using chest X-ray images. In this study, we evaluate various CNN -architectures, including binary classification, multi-label classification, and -ResNet50 models, to address challenges like dataset imbalance, variations in -image quality, and hidden biases. We introduce advanced preprocessing -techniques such as principal component analysis (PCA) for image compression and -propose a novel class-weighted loss function to mitigate imbalance issues. Our -results highlight the potential of CNNs in medical imaging but emphasize that -issues like unbalanced datasets and variations in image acquisition methods -must be addressed for optimal model performance. +Large language models (LLMs) have demonstrated remarkable capabilities in +various complex tasks, yet they still suffer from hallucinations. Introducing +external knowledge, such as knowledge graph, can enhance the LLMs' ability to +provide factual answers. LLMs have the ability to interactively explore +knowledge graphs. However, most approaches have been affected by insufficient +internal knowledge excavation in LLMs, limited generation of trustworthy +knowledge reasoning paths, and a vague integration between internal and +external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large +model framework driven by the collaboration of internal and external knowledge. +It relies on the internal knowledge of the LLM to guide the exploration of +interpretable directed subgraphs in external knowledge graphs, better +integrating the two knowledge sources for more accurate reasoning. Extensive +experiments on multiple real-world datasets confirm the superiority of +KnowPath. -摘要:機器學習,特別是卷積神經網路 (CNN) 已在醫學影像分析中展現出潛力,特別是使用胸部 X 光影像進行胸腔疾病偵測。在此研究中,我們評估各種 CNN 架構,包括二元分類、多標籤分類和 ResNet50 模型,以解決資料集不平衡、影像品質差異和隱藏偏差等挑戰。我們導入進階前處理技術,例如主成分分析 (PCA) 以進行影像壓縮,並提出一個新穎的類別加權損失函數來緩解不平衡問題。我們的結果突顯了 CNN 在醫學影像中的潛力,但強調必須解決資料集不平衡和影像擷取方法差異等問題,才能獲得最佳模型效能。 +摘要:大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力,但仍會出現幻覺。引入外部知識(例如知識圖譜)可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而,大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限,以及內部和外部知識之間的整合模糊的影響。因此,我們提出 KnowPath,這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索,更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。 -##### **PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation** -2502.10536v1 by Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner +##### **Atom of Thoughts for Markov LLM Test-Time Scaling** +2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo -The interpretation of histopathology cases underlies many important -diagnostic and treatment decisions in medicine. Notably, this process typically -requires pathologists to integrate and summarize findings across multiple -slides per case. Existing vision-language capabilities in computational -pathology have so far been largely limited to small regions of interest, larger -regions at low magnification, or single whole-slide images (WSIs). This limits -interpretation of findings that span multiple high-magnification regions across -multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model -(LMM) with a 1-million token context window, we demonstrate the ability to -generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches -from multiple WSIs at 10X magnification. This is the equivalent of up to 11 -hours of video at 1 fps. Expert pathologist evaluations demonstrate that the -generated report text is clinically accurate and equivalent to or preferred -over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide -examples with up to 5 slides. While performance decreased for examples with 6 -or more slides, this study demonstrates the promise of leveraging the -long-context capabilities of modern LMMs for the uniquely challenging task of -medical report generation where each case can contain thousands of image -patches. +Large Language Models (LLMs) achieve superior performance through +training-time scaling, and test-time scaling further enhances their +capabilities by conducting effective reasoning during inference. However, as +the scale of reasoning increases, existing test-time scaling methods suffer +from accumulated historical information, which not only wastes computational +resources but also interferes with effective reasoning. To address this issue, +we observe that complex reasoning progress is often achieved by solving a +sequence of independent subquestions, each being self-contained and verifiable. +These subquestions are essentially atomic questions, relying primarily on their +current state rather than accumulated history, similar to the memoryless +transitions in a Markov process. Based on this observation, we propose Atom of +Thoughts (AoT), where each state transition in the reasoning process consists +of decomposing the current question into a dependency-based directed acyclic +graph and contracting its subquestions, forming a new atomic question state. +This iterative decomposition-contraction process continues until reaching +directly solvable atomic questions, naturally realizing Markov transitions +between question states. Furthermore, these atomic questions can be seamlessly +integrated into existing test-time scaling methods, enabling AoT to serve as a +plug-in enhancement for improving reasoning capabilities. Experiments across +six benchmarks demonstrate the effectiveness of AoT both as a standalone +framework and a plug-in enhancement. Notably, on HotpotQA, when applied to +gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and +DeepSeek-R1 by 10.6%. The code will be available at +https://github.com/qixucen/atom. -摘要:組織病理學病例的解讀是許多重要的醫學診斷和治療決策的基礎。值得注意的是,這個過程通常需要病理學家整合和總結每個病例的許多玻片中的發現。迄今為止,計算機病理學中現有的視覺語言功能在很大程度上僅限於小範圍的感興趣區域、低倍率下的較大區域或單一的全玻片影像 (WSI)。這限制了跨多個 WSI 中多個高倍率區域的發現的解讀。通過使用 Gemini 1.5 Flash,一個具有 100 萬個令牌上下文視窗的大型多模態模型 (LMM),我們展示了從多個 WSI 中多達 40,000 個 768x768 像素圖像貼片(10 倍放大)生成底線診斷的能力。這相當於 1 fps 下長達 11 小時的影片。專家病理學家評估表明,生成的報告文字在臨床上是準確的,並且等同於或優於 68%(95% CI:[60%,76%])的多玻片範例(最多 5 個玻片)的原始報告。儘管對於有 6 個或更多玻片的範例,其性能下降,但這項研究證明了利用現代 LMM 的長上下文功能來應對獨特挑戰性的醫療報告生成任務,其中每個病例可能包含數千個影像貼片,這項任務的前景。 +摘要:大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能,而測試時間擴充透過在推論期間進行有效的推理,進一步提升其能力。然而,隨著推理規模的擴大,現有的測試時間擴充方法會受到累積的歷史資訊影響,這不僅會浪費運算資源,還會干擾有效的推理。為了解決這個問題,我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成,每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題,主要依賴於它們的當前狀態,而不是累積的歷史,類似於馬可夫過程中的無記憶轉換。基於這個觀察,我們提出了思想原子 (AoT),其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖,並收縮其子問題,形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行,直到達到可直接解決的原子問題,自然地實現問題狀態之間的馬可夫轉換。此外,這些原子問題可以無縫整合到現有的測試時間擴充方法中,讓 AoT 可以作為外掛程式強化功能,以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是,在 HotpotQA 上,當應用於 gpt-4o-mini 時,AoT 達到了 80.6% 的 F1 分數,比 o3-mini 高出 3.4%,比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。 -##### **Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks** -2502.10526v2 by Venkatesh Sivaraman, Anika Vaishampayan, Xiaotong Li, Brian R Buck, Ziyong Ma, Richard D Boyce, Adam Perer +##### **Generating Text from Uniform Meaning Representation** +2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein -Temporal predictive models have the potential to improve decisions in health -care, public services, and other domains, yet they often fail to effectively -support decision-makers. Prior literature shows that many misalignments between -model behavior and decision-makers' expectations stem from issues of model -specification, namely how, when, and for whom predictions are made. However, -model specifications for predictive tasks are highly technical and difficult -for non-data-scientist stakeholders to interpret and critique. To address this -challenge we developed Tempo, an interactive system that helps data scientists -and domain experts collaboratively iterate on model specifications. Using -Tempo's simple yet precise temporal query language, data scientists can quickly -prototype specifications with greater transparency about pre-processing -choices. Moreover, domain experts can assess performance within data subgroups -to validate that models behave as expected. Through three case studies, we -demonstrate how Tempo helps multidisciplinary teams quickly prune infeasible -specifications and identify more promising directions to explore. +Uniform Meaning Representation (UMR) is a recently developed graph-based +semantic representation, which expands on Abstract Meaning Representation (AMR) +in a number of ways, in particular through the inclusion of document-level +information and multilingual flexibility. In order to effectively adopt and +leverage UMR for downstream tasks, efforts must be placed toward developing a +UMR technological ecosystem. Though still limited amounts of UMR annotations +have been produced to date, in this work, we investigate the first approaches +to producing text from multilingual UMR graphs: (1) a pipeline conversion of +UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large +language models with UMR data, and (3) fine-tuning existing AMR-to-text +generation models with UMR data. Our best performing model achieves a +multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared +to the reference, which is a promising indication of the effectiveness of +fine-tuning approaches for UMR-to-text generation with even limited amounts of +UMR data. -摘要:時序預測模型有潛力改善醫療保健、公共服務和其他領域的決策,但它們經常無法有效支援決策者。先前的文獻顯示,模型行為與決策者期望之間的許多不一致源自於模型規範問題,也就是如何、何時以及針對誰進行預測。然而,預測任務的模型規範非常技術化,非數據科學家利害關係人難以解讀和批評。為了應對此挑戰,我們開發了 Tempo,一個互動式系統,可協助數據科學家和領域專家協同反覆運算模型規範。透過使用 Tempo 簡單但精確的時序查詢語言,數據科學家可以快速建構規範原型,並更透明地了解前處理的選擇。此外,領域專家可以評估資料子群組內的效能,以驗證模型是否如預期般運作。透過三個案例研究,我們展示 Tempo 如何協助跨領域團隊快速刪減不可行的規範,並找出更有希望探索的方向。 +摘要:統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示,它在許多方面擴展了抽象語意表示 (AMR),特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR,必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限,但在這項工作中,我們探討了從多語言 UMR 圖形產生文字的第一種方法:(1) 將 UMR 轉換為 AMR 的管道,然後使用 AMR 轉文字生成模型,(2) 使用 UMR 資料微調大型語言模型,以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比,我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數,在中文中達到 0.882,這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果,即使 UMR 資料數量有限。 -##### **A Robust Attack: Displacement Backdoor Attack** -2502.10490v1 by Yong Li, Han Gao +##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs** +2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han -As artificial intelligence becomes more prevalent in our lives, people are -enjoying the convenience it brings, but they are also facing hidden threats, -such as data poisoning and adversarial attacks. These threats can have -disastrous consequences for the application of artificial intelligence, -especially for some applications that take effect immediately, such as -autonomous driving and medical fields. Among these threats, backdoor attacks -have left a deep impression on people with their concealment and simple -deployment, making them a threat that cannot be ignored, however, in the -process of deploying the backdoor model, the backdoor attack often has some -reasons that make it unsatisfactory in real-world applications, such as jitter -and brightness changes. Based on this, we propose a highly robust backdoor -attack that shifts the target sample and combines it with itself to form a -backdoor sample, the Displacement Backdoor Attack(DBA). Experimental results -show that the DBA attack can resist data augmentation that simulates real-world -differences, such as rotation and cropping. +The rapid development of Multimodal Large Language Models (MLLMs) has enabled +the integration of multiple modalities, including texts and images, within the +large language model (LLM) framework. However, texts and images are usually +interconnected, forming a multimodal attributed graph (MMAG). It is +underexplored how MLLMs can incorporate the relational information +(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts +and images) on such graphs for multimodal comprehension and generation. In this +paper, we propose GraphGPT-o, which supports omni-multimodal understanding and +creation on MMAGs. We first comprehensively study linearization variants to +transform semantic and structural information as input for MLLMs. Then, we +propose a hierarchical aligner that enables deep graph encoding, bridging the +gap between MMAGs and MLLMs. Finally, we explore the inference choices, +adapting MLLM to interleaved text and image generation in graph scenarios. +Extensive experiments on three datasets from different domains demonstrate the +effectiveness of our proposed method. Datasets and codes will be open-sourced +upon acceptance. -摘要:随着人工智能在我们的生活中变得越来越普遍,人们正在享受它带来的便利,但也面临着隐藏的威胁,例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果,特别是对于一些立即生效的应用,例如自动驾驶和医疗领域。在这些威胁中,后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象,使其成为不可忽视的威胁,然而,在部署后门模型的过程中,后门攻击往往存在一些使其在实际应用中不尽如人意的原因,例如抖动和亮度变化。基于此,我们提出了一种高度鲁棒的后门攻击,该攻击对目标样本进行平移并将其与自身结合以形成后门样本,即置换后门攻击 (DBA)。实验结果表明,DBA 攻击可以抵抗模拟真实世界差异的数据增强,例如旋转和裁剪。 +摘要:多模态大语言模型 (MLLM) 的快速发展,促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而,文本和图像通常是相互关联的,形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息(即图结构)和语义信息(即文本和图像)以进行多模态理解和生成,目前仍未得到充分探索。在本文中,我们提出了 GraphGPT-o,它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体,以将语义和结构信息转换为 MLLM 的输入。然后,我们提出了一个分层对齐器,它支持深度图编码,弥合了 MMAG 和 MLLM 之间的差距。最后,我们探索了推理选择,使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。 -##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** -2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker +##### **Exploring LLM-based Student Simulation for Metacognitive Cultivation** +2502.11678v1 by Haoxuan Li, Jifan Yu, Xin Cong, Yang Dang, Yisi Zhan, Huiqin Liu, Zhiyuan Liu -Explainability remains a significant problem for AI models in medical -imaging, making it challenging for clinicians to trust AI-driven predictions. -We introduce 3D ReX, the first causality-based post-hoc explainability tool for -3D models. 3D ReX uses the theory of actual causality to generate -responsibility maps which highlight the regions most crucial to the model's -decision. We test 3D ReX on a stroke detection model, providing insight into -the spatial distribution of features relevant to stroke. +Metacognitive education plays a crucial role in cultivating students' +self-regulation and reflective thinking, providing essential support for those +with learning difficulties through academic advising. Simulating students with +insufficient learning capabilities using large language models offers a +promising approach to refining pedagogical methods without ethical concerns. +However, existing simulations often fail to authentically represent students' +learning struggles and face challenges in evaluation due to the lack of +reliable metrics and ethical constraints in data collection. To address these +issues, we propose a pipeline for automatically generating and filtering +high-quality simulated student agents. Our approach leverages a two-round +automated scoring system validated by human experts and employs a score +propagation module to obtain more consistent scores across the student graph. +Experimental results demonstrate that our pipeline efficiently identifies +high-quality student agents, and we discuss the traits that influence the +simulation's effectiveness. By simulating students with varying degrees of +learning difficulties, our work paves the way for broader applications in +personalized learning and educational assessment. -摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 -我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 +摘要:元認知教育在培養學生的自我調節和反思性思考中發揮著至關重要的作用,通過學術諮詢為有學習困難的人提供必要的支持。使用大型語言模型模擬學習能力不足的學生提供了一種有前途的方法,可以在沒有道德問題的情況下改進教學方法。然而,現有的模擬通常無法真實地反映學生的學習困難,並且由於缺乏可靠的指標和數據收集中的道德約束,在評估中面臨挑戰。為了解決這些問題,我們提出了一個自動生成和過濾高質量模擬學生代理的管道。我們的做法利用了由人類專家驗證的兩輪自動評分系統,並採用分數傳播模組來獲得跨學生圖表更一致的分數。實驗結果表明,我們的管道有效地識別了高質量的學生代理,並且我們討論了影響模擬效果的特質。通過模擬具有不同程度學習困難的學生,我們的研究為個性化學習和教育評估中的更廣泛應用鋪平了道路。 -##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model** -2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott +##### **Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering** +2502.11491v1 by Runxuan Liu, Bei Luo, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin -In the analysis of remote healthcare monitoring data, time series -representation learning offers substantial value in uncovering deeper patterns -of patient behavior, especially given the fine temporal granularity of the -data. In this study, we focus on a dataset of home activity records from people -living with Dementia. We propose a two-stage self-supervised learning approach. -The first stage involves converting time-series activities into text strings, -which are then encoded by a fine-tuned language model. In the second stage, -these time-series vectors are bi-dimensionalized for applying PageRank method, -to analyze latent state transitions to quantitatively assess participants -behavioral patterns and identify activity biases. These insights, combined with -diagnostic data, aim to support personalized care interventions. +Large language models (LLMs) have shown remarkable capabilities in natural +language processing. However, in knowledge graph question answering tasks +(KGQA), there remains the issue of answering questions that require multi-hop +reasoning. Existing methods rely on entity vector matching, but the purpose of +the question is abstract and difficult to match with specific entities. As a +result, it is difficult to establish reasoning paths to the purpose, which +leads to information loss and redundancy. To address this issue, inspired by +human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a +novel framework that constructs reasoning paths from purposes back to +conditions. ORT operates in three key phases: (1) using LLM to extract purpose +labels and condition labels, (2) constructing label reasoning paths based on +the KG ontology, and (3) using the label reasoning paths to guide knowledge +retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves +state-of-the-art performance and significantly enhances the capability of LLMs +for KGQA. -摘要:在遠程醫療監控數據分析中,時序表示學習在揭示患者行為的更深層模式方面提供了實質性的價值,特別是考慮到數據的精細時間粒度。在本研究中,我們專注於痴呆症患者居家活動記錄的數據集。我們提出了一種兩階段的自我監督學習方法。第一階段涉及將時序活動轉換為文本串,然後由微調語言模型編碼。在第二階段,這些時序向量被雙維化以應用 PageRank 方法,分析潛在狀態轉換以定量評估參與者的行為模式並識別活動偏差。這些見解與診斷數據相結合,旨在支持個性化護理干預。 +摘要:大型語言模型 (LLM) 在自然語言處理中展現出卓越的能力。然而,在知識圖譜問答任務 (KGQA) 中,仍然存在需要多跳推理才能回答問題的問題。現有方法依賴於實體向量匹配,但問題的目的是抽象的,難以與特定實體匹配。因此,很難建立推理路徑來達成目的,這會導致資訊遺失和冗餘。為了解決這個問題,在人類逆向思維的啟發下,我們提出了基於本体的逆向思維 (ORT),這是一個創新的架構,可以從目的建構推理路徑,再回推到條件。ORT 運作在三個關鍵階段:(1) 使用 LLM 萃取目的標籤和條件標籤,(2) 基於 KG 本体建構標籤推理路徑,以及 (3) 使用標籤推理路徑來引導知識擷取。在 WebQSP 和 CWQ 資料集上的實驗顯示,ORT 達到了最先進的效能,並顯著增強了 LLM 對 KGQA 的能力。 -##### **TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation** -2502.09931v1 by Ju-Hyeon Nam, Nur Suriza Syazwany, Sang-Chul Lee +##### **GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion** +2502.11471v1 by Kangyang Luo, Yuzhuo Bai, Cheng Gao, Shuzheng Si, Yingli Shen, Zhu Liu, Zhitong Wang, Cunliang Kong, Wenhao Li, Yufei Huang, Ye Tian, Xuantang Xiong, Lei Han, Maosong Sun -Skip connection engineering is primarily employed to address the semantic gap -between the encoder and decoder, while also integrating global dependencies to -understand the relationships among complex anatomical structures in medical -image segmentation. Although several models have proposed transformer-based -approaches to incorporate global dependencies within skip connections, they -often face limitations in capturing detailed local features with high -computational complexity. In contrast, graph neural networks (GNNs) exploit -graph structures to effectively capture local and global features. Leveraging -these properties, we introduce an attentional cross-scale graph neural network -(ACS-GNN), which enhances the skip connection framework by converting -cross-scale feature maps into a graph structure and capturing complex -anatomical structures through node attention. Additionally, we observed that -deep learning models often produce uninformative feature maps, which degrades -the quality of spatial attention maps. To address this problem, we integrated -entropy-driven feature selection (EFS) with spatial attention, calculating an -entropy score for each channel and filtering out high-entropy feature maps. Our -innovative framework, TransGUNet, comprises ACS-GNN and EFS-based spatial -attentio} to effectively enhance domain generalizability across various -modalities by leveraging GNNs alongside a reliable spatial attention map, -ensuring more robust features within the skip connection. Through comprehensive -experiments and analysis, TransGUNet achieved superior segmentation performance -on six seen and eight unseen datasets, demonstrating significantly higher -efficiency compared to previous methods. +Knowledge Graph Completion (KGC), which aims to infer missing or incomplete +facts, is a crucial task for KGs. However, integrating the vital structural +information of KGs into Large Language Models (LLMs) and outputting predictions +deterministically remains challenging. To address this, we propose a new method +called GLTW, which encodes the structural information of KGs and merges it with +LLMs to enhance KGC performance. Specifically, we introduce an improved Graph +Transformer (iGT) that effectively encodes subgraphs with both local and global +structural information and inherits the characteristics of language model, +bypassing training from scratch. Also, we develop a subgraph-based +multi-classification training objective, using all entities within KG as +classification objects, to boost learning efficiency.Importantly, we combine +iGT with an LLM that takes KG language prompts as input.Our extensive +experiments on various KG datasets show that GLTW achieves significant +performance gains compared to SOTA baselines. -摘要:跳躍連接工程主要用於解決編碼器和解碼器之間的語義鴻溝,同時還整合全局依賴關係以了解醫學影像分割中複雜解剖結構之間的關係。儘管有幾個模型提出了基於Transformer的架構來整合跳躍連接中的全局依賴關係,但它們在以高計算複雜度擷取詳細的局部特徵時常常面臨限制。相比之下,圖神經網路 (GNN) 利用圖結構有效擷取局部和全局特徵。利用這些屬性,我們引入了注意力跨尺度圖神經網路 (ACS-GNN),它通過將跨尺度特徵圖轉換為圖結構並通過節點注意力擷取複雜的解剖結構來增強跳躍連接框架。此外,我們觀察到深度學習模型通常會產生無意義的特徵圖,這會降低空間注意力圖的品質。為了解決這個問題,我們將熵驅動特徵選擇 (EFS) 與空間注意力整合在一起,為每個通道計算熵分數並濾出高熵特徵圖。我們創新的框架 TransGUNet 包含 ACS-GNN 和基於 EFS 的空間注意力,通過利用 GNN 以及可靠的空間注意力圖有效增強跨各種模態的域泛化能力,確保跳躍連接中更強大的特徵。透過全面的實驗和分析,TransGUNet 在六個已見和八個未見的資料集上實現了優異的分割效能,證明與先前的方法相比,效率顯著提高。 +摘要:知識圖譜補全 (KGC) 旨在推論遺失或不完整的 +事實,是 KGs 的一項關鍵任務。然而,將 KGs 的重要結構 +資訊整合至大型語言模型 (LLM),並確定性地輸出預測結果,仍然是一項挑戰。為了解決這個問題,我們提出了一種新的方法,稱為 GLTW,它編碼了 KGs 的結構資訊,並將其與 LLM 合併,以增強 KGC 的效能。具體來說,我們引進了一個改良的圖形轉換器 (iGT),它能有效地編碼具有局部和全域結構資訊的子圖,並繼承語言模型的特徵,繞過從頭開始的訓練。此外,我們開發了一個基於子圖的多分類訓練目標,使用 KG 中的所有實體作為 +分類物件,以提升學習效率。重要的是,我們將 iGT 與一個將 KG 語言提示作為輸入的 LLM 結合起來。我們在各種 KG 資料集上進行的廣泛實驗顯示,與 SOTA 基準線相比,GLTW 獲得了顯著的效能提升。 -##### **Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos** -2502.09886v1 by Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, Pieter Abbeel +##### **Large Language-Geometry Model: When LLM meets Equivariance** +2502.11149v2 by Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao -Simulation offers a promising approach for cheaply scaling training data for -generalist policies. To scalably generate data from diverse and realistic -tasks, existing algorithms either rely on large language models (LLMs) that may -hallucinate tasks not interesting for robotics; or digital twins, which require -careful real-to-sim alignment and are hard to scale. To address these -challenges, we introduce Video2Policy, a novel framework that leverages -internet RGB videos to reconstruct tasks based on everyday human behavior. Our -approach comprises two phases: (1) task generation in simulation from videos; -and (2) reinforcement learning utilizing in-context LLM-generated reward -functions iteratively. We demonstrate the efficacy of Video2Policy by -reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, -which depicts diverse and complex human behaviors on 9 different tasks. Our -method can successfully train RL policies on such tasks, including complex and -challenging tasks such as throwing. Finally, we show that the generated -simulation data can be scaled up for training a general policy, and it can be -transferred back to the real robot in a Real2Sim2Real way. +Accurately predicting 3D structures and dynamics of physical systems is +crucial in scientific applications. Existing approaches that rely on geometric +Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, +but they often fall in leveraging extensive broader information. While direct +application of Large Language Models (LLMs) can incorporate external knowledge, +they lack the capability for spatial reasoning with guaranteed equivariance. In +this paper, we propose EquiLLM, a novel framework for representing 3D physical +systems that seamlessly integrates E(3)-equivariance with LLM capabilities. +Specifically, EquiLLM comprises four key components: geometry-aware prompting, +an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the +LLM guided by the instructive prompt serves as a sophisticated invariant +feature processor, while 3D directional information is exclusively handled by +the equivariant encoder and adaptor modules. Experimental results demonstrate +that EquiLLM delivers significant improvements over previous methods across +molecular dynamics simulation, human motion simulation, and antibody design, +highlighting its promising generalizability. -摘要:模擬提供了一種有前途的方法,可以用於擴展訓練資料,以制定通才政策。為了從多樣化且逼真的任務中可擴充地產生資料,現有演算法仰賴大型語言模型 (LLM),這些模型可能會產生對機器人技術不感興趣的任務;或者仰賴數位雙胞胎,這需要仔細地將真實環境與模擬環境對齊,而且很難擴充。為了應對這些挑戰,我們引入了 Video2Policy,這是一個新穎的架構,它利用網路上的 RGB 影片,根據日常人類行為來重建任務。我們的做法包含兩個階段:(1) 從影片中在模擬環境中產生任務;以及 (2) 利用在情境中由 LLM 產生的獎勵函數,反覆進行強化學習。我們透過重建 Something-Something-v2 (SSv2) 資料集中的 100 多個影片來展示 Video2Policy 的效能,這些影片描繪了 9 項不同任務中多樣化且複雜的人類行為。我們的做法可以在這些任務上成功訓練 RL 政策,包括複雜且具挑戰性的任務,例如投擲。最後,我們展示了產生的模擬資料可以擴充到訓練一般政策,而且可以透過 Real2Sim2Real 的方式轉移回真實機器人。 +摘要:準確預測物理系統的 3D 結構和動力學在科學應用中至關重要。現有依賴於幾何圖神經網路 (GNN) 的方法有效地強制執行了 $\mathrm{E}(3)$-等變性,但它們通常無法利用廣泛的更廣泛資訊。儘管大型語言模型 (LLM) 的直接應用可以納入外部知識,但它們缺乏保證等變性的空間推理能力。在本文中,我們提出了 EquiLLM,一個用於表示 3D 物理系統的新框架,它將 E(3)-等變性與 LLM 能力無縫整合。具體來說,EquiLLM 包含四個關鍵組成部分:感知幾何的提示、等變編碼器、LLM 和等變適配器。從本質上講,由指導性提示引導的 LLM 作為一個複雜的不變特徵處理器,而 3D 方向資訊則由等變編碼器和適配器模組獨家處理。實驗結果表明,EquiLLM 在分子動力學模擬、人類運動模擬和抗體設計方面比以前的方法有了顯著的改進,突顯了其有希望的泛化能力。 -##### **HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation** -2502.09838v2 by Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi +##### **Beyond Pairwise: Global Zero-shot Temporal Graph Generation** +2502.11114v1 by Alon Eirew, Kfir Bar, Ido Dagan -We present HealthGPT, a powerful Medical Large Vision-Language Model -(Med-LVLM) that integrates medical visual comprehension and generation -capabilities within a unified autoregressive paradigm. Our bootstrapping -philosophy is to progressively adapt heterogeneous comprehension and generation -knowledge to pre-trained large language models (LLMs). This is achieved through -a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is -complemented by a tailored hierarchical visual perception approach and a -three-stage learning strategy. To effectively learn the HealthGPT, we devise a -comprehensive medical domain-specific comprehension and generation dataset -called VL-Health. Experimental results demonstrate exceptional performance and -scalability of HealthGPT in medical visual unified tasks. Our project can be -accessed at https://github.com/DCDmllm/HealthGPT. +Temporal relation extraction (TRE) is a fundamental task in natural language +processing (NLP) that involves identifying the temporal relationships between +events in a document. Despite the advances in large language models (LLMs), +their application to TRE remains limited. Most existing approaches rely on +pairwise classification, in which event pairs are considered individually, +leading to computational inefficiency and a lack of global consistency in the +resulting temporal graph. In this work, we propose a novel zero-shot method for +TRE that generates a document's complete temporal graph at once, then applies +transitive constraints optimization to refine predictions and enforce temporal +consistency across relations. Additionally, we introduce OmniTemp, a new +dataset with complete annotations for all pairs of targeted events within a +document. Through experiments and analyses, we demonstrate that our method +significantly outperforms existing zero-shot approaches while achieving +competitive performance with supervised models. -摘要:我們提出 HealthGPT,一種強大的醫學大型視覺語言模型 (Med-LVLM),它整合了醫學視覺理解和生成能力於一個統一的自動迴歸範例中。我們的引導哲學是逐步調整異質理解和生成知識以預先訓練大型語言模型 (LLM)。這是通過一種新穎的異質低秩適應 (H-LoRA) 技術實現的,該技術由量身定制的分層視覺感知方法和三階段學習策略補充。為了有效學習 HealthGPT,我們設計了一個全面的醫學領域特定理解和生成數據集,稱為 VL-Health。實驗結果證明了 HealthGPT 在醫學視覺統一任務中的卓越性能和可擴展性。我們的項目可以在 https://github.com/DCDmllm/HealthGPT 中訪問。 +摘要:時間關係抽取 (TRE) 是自然語言處理 (NLP) 中的一項基本任務,涉及識別文件中事件之間的時間關係。儘管大型語言模型 (LLM) 取得進展,但它們在 TRE 中的應用仍然有限。現有的大多數方法依賴於成對分類,其中事件對被單獨考慮,導致計算效率低下且在生成的時序圖中缺乏全局一致性。在這項工作中,我們提出了一種新穎的 TRE 零次學習方法,它可以一次生成文件的完整時序圖,然後應用遞移約束最佳化來優化預測並強制關係之間的時間一致性。此外,我們引入了 OmniTemp,這是一個新的數據集,其中包含文件內所有目標事件對的完整註解。通過實驗和分析,我們證明了我們的方法明顯優於現有的零次學習方法,同時實現了與監督模型相當的性能。 -##### **Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games** -2502.09780v1 by Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi +##### **Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications** +2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy -Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of -applications involving the interaction of a group of agents in a shared unknown -environment. A prominent framework for studying MARL is Markov games, with the -goal of finding various notions of equilibria in a sample-efficient manner, -such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). -However, existing sample-efficient approaches either require tailored -uncertainty estimation under function approximation, or careful coordination of -the players. In this paper, we propose a novel model-based algorithm, called -VMG, that incentivizes exploration via biasing the empirical estimate of the -model parameters towards those with a higher collective best-response values of -all the players when fixing the other players' policies, thus encouraging the -policy to deviate from its current equilibrium for more exploration. VMG is -oblivious to different forms of function approximation, and permits -simultaneous and uncoupled policy updates of all players. Theoretically, we -also establish that VMG achieves a near-optimal regret for finding both the NEs -of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov -games under linear function approximation in an online environment, which -nearly match their counterparts with sophisticated uncertainty quantification. +Large language models (LLMs) have significantly advanced the field of natural +language generation. However, they frequently generate unverified outputs, +which compromises their reliability in critical applications. In this study, we +propose an innovative framework that combines structured biomedical knowledge +with LLMs through a retrieval-augmented generation technique. Our system +develops a thorough knowledge graph by identifying and refining causal +relationships and named entities from medical abstracts related to age-related +macular degeneration (AMD). Using a vector-based retrieval process and a +locally deployed language model, our framework produces responses that are both +contextually relevant and verifiable, with direct references to clinical +evidence. Experimental results show that this method notably decreases +hallucinations, enhances factual precision, and improves the clarity of +generated responses, providing a robust solution for advanced biomedical +chatbot applications. -摘要:多智能體強化學習 (MARL) 是一系列應用程式的心臟,這些應用程式涉及一群智能體在一個共用未知環境中的互動。研究 MARL 的一個著名框架是馬可夫博弈,其目標是用樣本有效率的方式找出各種均衡概念,例如納許均衡 (NE) 和粗相關均衡 (CCE)。然而,現有的樣本有效率方法需要在函數逼近下進行量身打造的不確定性估計,或謹慎協調參與者。在本文中,我們提出了一種新的基於模型的演算法,稱為 VMG,它透過將模型參數的經驗估計值偏向於在固定其他參與者政策時所有參與者的集體最佳反應值,從而激勵探索,進而鼓勵政策偏離其當前均衡以進行更多探索。VMG 不會忽略函數逼近的不同形式,並允許所有參與者同時進行非耦合的政策更新。在理論上,我們也建立了 VMG 在線上環境中使用線性函數逼近來尋找雙人零和馬可夫博弈的 NE 和多人一般和馬可夫博弈的 CCE 時,會獲得接近最佳的後悔,這幾乎與其在不確定性量化方面更為複雜的對應物相匹配。 +摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。 -##### **The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention** -2502.09757v1 by Bereket A. Yilma, Chan Mi Kim, Geke Ludden, Thomas van Rompay, Luis A. Leiva +##### **Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection** +2502.11062v1 by Yang Zhao, Li Du, Xiao Ding, Yangou Ouyang, Hepeng Wang, Kai Xiong, Jinglong Gao, Zhouhao Sun, Dongliang Xu, Yang Qing, Dongchen Li, Bing Qin, Ting Liu -Post-intensive care syndrome (PICS) is a multifaceted condition that arises -from prolonged stays in an intensive care unit (ICU). While preventing PICS -among ICU patients is becoming increasingly important, interventions remain -limited. Building on evidence supporting the effectiveness of art exposure in -addressing the psychological aspects of PICS, we propose a novel art therapy -solution through a collaborative Human-AI approach that enhances personalized -therapeutic interventions using state-of-the-art Visual Art Recommendation -Systems. We developed two Human-in-the-Loop (HITL) personalization methods and -assessed their impact through a large-scale user study (N=150). Our findings -demonstrate that this Human-AI collaboration not only enhances the -personalization and effectiveness of art therapy but also supports therapists -by streamlining their workload. While our study centres on PICS intervention, -the results suggest that human-AI collaborative Art therapy could potentially -benefit other areas where emotional support is critical, such as cases of -anxiety and depression. +Large language models (LLMs) have shown great potential across various +industries due to their remarkable ability to generalize through instruction +tuning. However, the limited availability of domain-specific data significantly +hampers their performance on specialized tasks. While existing methods +primarily focus on selecting training data from general datasets that are +similar to the target domain, they often fail to consider the joint +distribution of instructions, resulting in inefficient learning and suboptimal +knowledge transfer. To address these challenges, we introduce G2IS +(Gradient-based Graph Instruction Selection), a novel method that constructs a +mixed gradient-based instruction graph to capture the joint distribution and +interdependencies between instructions. By accounting for the relationships +between instructions, G2IS improves domain adaptation efficiency. Additionally, +we propose a gradient walk algorithm to refine the data selection process, +enhancing both training effectiveness and efficiency. Our experiments +demonstrate that G2IS outperforms traditional methods across various domain +adaptation tasks, yielding significant performance gains, particularly in +complex, data-scarce scenarios. These results underscore the potential of G2IS +in advancing the development of large, domain-specific models. -摘要:重症後症候群 (PICS) 是一種多面向的疾病,源自於在加護病房 (ICU) 長期住院。雖然預防重症後症候群在加護病房患者中正變得越來越重要,但介入措施仍然有限。建立在支持藝術接觸在解決重症後症候群心理層面的證據上,我們提出一個創新的藝術療法解決方案,透過協作式的人工智慧方法,使用最先進的視覺藝術推薦系統,增強個人化的治療介入。我們開發了兩種人機迴路 (HITL) 個人化方法,並透過大規模使用者研究 (N=150) 評估其影響。我們的發現證明,這種人機協作不僅增強了藝術治療的個人化和有效性,也透過簡化治療師的工作量來提供支援。雖然我們的研究中心在重症後症候群介入,但結果顯示,人機協作藝術療法有可能對其他需要情緒支持的領域有益,例如焦慮和憂鬱症。 +摘要:大型語言模型 (LLM) 因其透過指令微調而具備的卓越泛化能力,在各產業中展現出極大的潛力。然而,特定領域資料的取得有限,大幅影響其在專業任務上的表現。現有方法主要專注於從與目標領域類似的通用資料集中選取訓練資料,但它們通常未能考量指令的聯合分佈,導致學習效率不彰且知識傳遞不佳。為了應對這些挑戰,我們引進 G2IS(基於梯度的圖形指令選取),這是一種創新的方法,可建構一個混合的基於梯度的指令圖形,以擷取指令之間的聯合分佈和相互依賴性。透過考量指令之間的關係,G2IS 提升了領域適應的效率。此外,我們提出了一種梯度漫步演算法來優化資料選取程序,同時提升訓練效能和效率。我們的實驗證明,G2IS 在各種領域適應任務中優於傳統方法,產生顯著的效能提升,特別是在資料稀少的複雜場景中。這些結果突顯了 G2IS 在推動大型特定領域模型發展方面的潛力。 -##### **A CNN Approach to Automated Detection and Classification of Brain Tumors** -2502.09731v1 by Md. Zahid Hasan, Abdullah Tamim, D. M. Asadujjaman, Md. Mahfujur Rahman, Md. Abu Ahnaf Mollick, Nosin Anjum Dristi, Abdullah-Al-Noman +##### **CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models** +2502.11008v1 by Yuefei Chen, Vivek K. Singh, Jing Ma, Ruxiang Tang -Brain tumors require an assessment to ensure timely diagnosis and effective -patient treatment. Morphological factors such as size, location, texture, and -variable appearance complicate tumor inspection. Medical imaging presents -challenges, including noise and incomplete images. This research article -presents a methodology for processing Magnetic Resonance Imaging (MRI) data, -encompassing techniques for image classification and denoising. The effective -use of MRI images allows medical professionals to detect brain disorders, -including tumors. This research aims to categorize healthy brain tissue and -brain tumors by analyzing the provided MRI data. Unlike alternative methods -like Computed Tomography (CT), MRI technology offers a more detailed -representation of internal anatomical components, making it a suitable option -for studying data related to brain tumors. The MRI picture is first subjected -to a denoising technique utilizing an Anisotropic diffusion filter. The dataset -utilized for the models creation is a publicly accessible and validated Brain -Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE -was employed for data augmentation and dataset balancing. Convolutional Neural -Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for -the classification procedure. EfficientNet attained an accuracy of 98%, the -highest recorded. +Counterfactual reasoning is widely recognized as one of the most challenging +and intricate aspects of causality in artificial intelligence. In this paper, +we evaluate the performance of large language models (LLMs) in counterfactual +reasoning. In contrast to previous studies that primarily focus on commonsense +causal reasoning, where LLMs often rely on prior knowledge for inference, we +specifically assess their ability to perform counterfactual inference using a +set of formal rules. To support this evaluation, we introduce a new benchmark +dataset, CounterBench, comprising 1K counterfactual reasoning questions. The +dataset is designed with varying levels of difficulty, diverse causal graph +structures, distinct types of counterfactual questions, and multiple +nonsensical name variants. Our experiments demonstrate that counterfactual +reasoning poses a significant challenge for LLMs, with most models performing +at levels comparable to random guessing. To enhance LLM's counterfactual +reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides +LLMs through iterative reasoning and backtracking to systematically explore +counterfactual solutions. Experimental results show that our method +significantly improves LLM performance on counterfactual reasoning tasks and +consistently enhances performance across different LLMs.Our dataset is +available at https://huggingface.co/datasets/CounterBench/CounterBench. -摘要:腦腫瘤需要評估以確保及時診斷和有效的患者治療。大小、位置、質地和可變外觀等形態因素會使腫瘤檢查複雜化。醫學影像會呈現挑戰,包括雜訊和不完整的影像。本研究文章提出了一種處理磁共振影像 (MRI) 資料的方法,包含影像分類和去噪技術。有效使用 MRI 影像可讓醫護人員偵測腦部疾病,包括腫瘤。本研究旨在透過分析提供的 MRI 資料來分類健康的腦組織和腦瘤。與電腦斷層掃描 (CT) 等替代方法不同,MRI 技術提供了更詳細的內部解剖結構表示,使其成為研究與腦瘤相關資料的合適選擇。MRI 影像會先使用各向異性擴散濾波器進行去噪技術處理。用於建立模型的資料集是一個公開且經過驗證的腦腫瘤分類 (MRI) 資料庫,包含 3,264 個腦部 MRI 掃描。SMOTE 用於資料擴充和資料集平衡。卷積神經網路 (CNN),例如 ResNet152V2、VGG、ViT 和 EfficientNet,用於分類程序。EfficientNet 達到了 98% 的準確度,是記錄到的最高值。 +摘要:反事實推理被廣泛認為是人工智慧中因果關係最具挑戰性和複雜的面向之一。在本文中,我們評估大型語言模型 (LLM) 在反事實推理中的表現。與主要關注常識因果推理,其中 LLM 經常依賴先驗知識來進行推理的先前研究不同,我們特別評估它們使用一組形式規則執行反事實推理的能力。為了支持此評估,我們引入了一個新的基準資料集 CounterBench,其中包含 1K 個反事實推理問題。資料集的設計具有不同的難度等級、多樣化的因果圖結構、不同類型的反事實問題和多種無意義的名稱變體。我們的實驗表明,反事實推理對 LLM 構成重大挑戰,大多數模型的表現與隨機猜測相當。為了增強 LLM 的反事實推理能力,我們提出了一種新穎的推理範例 CoIn,它引導 LLM 透過反覆推理和回溯系統性地探索反事實解。實驗結果表明,我們的方法顯著提升 LLM 在反事實推理任務上的表現,並持續增強不同 LLM 的表現。我們的資料集可在 https://huggingface.co/datasets/CounterBench/CounterBench 取得。 -##### **Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data** -2502.09715v1 by Yu Leng, Yingnan He, Colin Magdamo, Ana-Maria Vranceanu, Christine S. Ritchie, Shibani S. Mukerji, Lidia M. V. R. Moura, John R. Dickson, Deborah Blacker, Sudeshna Das +##### **RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation** +2502.10996v1 by Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han -Identifying cognitive impairment within electronic health records (EHRs) is -crucial not only for timely diagnoses but also for facilitating research. -Information about cognitive impairment often exists within unstructured -clinician notes in EHRs, but manual chart reviews are both time-consuming and -error-prone. To address this issue, our study evaluates an automated approach -using zero-shot GPT-4o to determine stage of cognitive impairment in two -different tasks. First, we evaluated the ability of GPT-4o to determine the -global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who -visited the memory clinic at Massachusetts General Hospital (MGH), and achieved -a weighted kappa score of 0.83. Second, we assessed GPT-4o's ability to -differentiate between normal cognition, mild cognitive impairment (MCI), and -dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o -attained a weighted kappa score of 0.91 in comparison to specialist chart -reviews and 0.96 on cases that the clinical adjudicators rated with high -confidence. Our findings demonstrate GPT-4o's potential as a scalable chart -review tool for creating research datasets and assisting diagnosis in clinical -settings in the future. +Retrieval-augmented language models often struggle with knowledge-intensive +tasks due to inefficient retrieval, unstructured knowledge integration, and +single-pass architectures. We present Retrieval-And-Structuring (RAS), a novel +framework that dynamically constructs and reasons over query-specific knowledge +graphs through iterative retrieval and structuring. RAS introduces four key +technical innovations: (1) a themescoped retrieval mechanism that efficiently +narrows the search space while maintaining retrieval quality, (2) an action +planning module that determines knowledge needs and generates focused +sub-queries, (3) a dynamic knowledge structuring approach that converts +retrieved text into an evolving knowledge graph, and (4) a graph-augmented +answering component that leverages the accumulated structured information. Our +framework achieves state-of-the-art performance, surpassing leading baselines +by 6.4% with open-source language models and 7.0% with proprietary models on +seven knowledge-intensive generation datasets across all evaluation metrics. +Detailed ablation studies verify the contribution of each technical component +to the overall system performance. -摘要:在電子健康記錄 (EHR) 中識別認知障礙不僅對及時診斷至關重要,也有助於促進研究。有關認知障礙的資訊通常存在於 EHR 中非結構化的臨床記錄中,但手動圖表審查既耗時又容易出錯。為了解決這個問題,我們的研究評估了一種自動化方法,使用零次學習的 GPT-4o 來確定兩種不同任務中的認知障礙分期。首先,我們評估了 GPT-4o 確定來自麻薩諸塞州總醫院 (MGH) 記憶診所 769 名患者的專科記錄的全球臨床痴呆評分 (CDR) 的能力,並獲得了 0.83 的加權 kappa 分數。其次,我們評估了 GPT-4o 在 860 名 Medicare 患者 3 年視窗中的所有記錄中區分正常認知、輕度認知障礙 (MCI) 和痴呆的能力。與專科圖表審查相比,GPT-4o 獲得了 0.91 的加權 kappa 分數,而對於臨床評審員以高度信心評估的病例,其加權 kappa 分數為 0.96。我們的研究結果證明了 GPT-4o 作為可擴充圖表審查工具的潛力,可用於建立研究資料集並協助未來臨床環境中的診斷。 +摘要:检索增强语言模型通常会因检索效率低、知识整合无结构和单次通过架构而难以胜任知识密集型任务。我们提出检索和结构化 (RAS),这是一个新颖的框架,通过迭代检索和结构化,动态构建和推理特定于查询的知识图谱。RAS 引入了四项关键技术创新:(1) 主题范围检索机制,在保持检索质量的同时有效缩小搜索空间,(2) 动作规划模块,确定知识需求并生成重点子查询,(3) 动态知识结构化方法,将检索到的文本转换为不断发展的知识图谱,以及 (4) 图谱增强型回答组件,利用累积的结构化信息。我们的框架实现了最先进的性能,在七个知识密集型生成数据集上,使用开源语言模型提高了 6.4%,使用专有模型提高了 7.0%,超越了领先的基线,且所有评估指标均如此。详细的消融研究验证了每个技术组件对整体系统性能的贡献。 -##### **Metamorphic Testing for Pose Estimation Systems** -2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque +##### **Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia** +2502.10896v1 by Rohith Perumandla, Young-Ho Bae, Diego Izaguirre, Esther Hwang, Andrew Murphy, Long-Jing Hsu, Selma Sabanovic, Casey C. Bennett -Pose estimation systems are used in a variety of fields, from sports -analytics to livestock care. Given their potential impact, it is paramount to -systematically test their behaviour and potential for failure. This is a -complex task due to the oracle problem and the high cost of manual labelling -necessary to build ground truth keypoints. This problem is exacerbated by the -fact that different applications require systems to focus on different subjects -(e.g., human versus animal) or landmarks (e.g., only extremities versus whole -body and face), which makes labelled test data rarely reusable. To combat these -problems we propose MET-POSE, a metamorphic testing framework for pose -estimation systems that bypasses the need for manual annotation while assessing -the performance of these systems under different circumstances. MET-POSE thus -allows users of pose estimation systems to assess the systems in conditions -that more closely relate to their application without having to label an ad-hoc -test dataset or rely only on available datasets, which may not be adapted to -their application domain. While we define MET-POSE in general terms, we also -present a non-exhaustive list of metamorphic rules that represent common -challenges in computer vision applications, as well as a specific way to -evaluate these rules. We then experimentally show the effectiveness of MET-POSE -by applying it to Mediapipe Holistic, a state of the art human pose estimation -system, with the FLIC and PHOENIX datasets. With these experiments, we outline -numerous ways in which the outputs of MET-POSE can uncover faults in pose -estimation systems at a similar or higher rate than classic testing using hand -labelled data, and show that users can tailor the rule set they use to the -faults and level of accuracy relevant to their application. +This study presents the development and testing of a conversational speech +system designed for robots to detect speech biomarkers indicative of cognitive +impairments in people living with dementia (PLwD). The system integrates a +backend Python WebSocket server and a central core module with a large language +model (LLM) fine-tuned for dementia to process user input and generate robotic +conversation responses in real-time in less than 1.5 seconds. The frontend user +interface, a Progressive Web App (PWA), displays information and biomarker +score graphs on a smartphone in real-time to human users (PLwD, caregivers, +clinicians). Six speech biomarkers based on the existing literature - Altered +Grammar, Pragmatic Impairments, Anomia, Disrupted Turn-Taking, Slurred +Pronunciation, and Prosody Changes - were developed for the robot conversation +system using two datasets, one that included conversations of PLwD with a human +clinician (DementiaBank dataset) and one that included conversations of PLwD +with a robot (Indiana dataset). We also created a composite speech biomarker +that combined all six individual biomarkers into a single score. The speech +system's performance was first evaluated on the DementiaBank dataset showing +moderate correlation with MMSE scores, with the composite biomarker score +outperforming individual biomarkers. Analysis of the Indiana dataset revealed +higher and more variable biomarker scores, suggesting potential differences due +to study populations (e.g. severity of dementia) and the conversational +scenario (human-robot conversations are different from human-human). The +findings underscore the need for further research on the impact of +conversational scenarios on speech biomarkers and the potential clinical +applications of robotic speech systems. -摘要:姿勢估計系統應用於各種領域,從運動分析到牲畜照護。鑑於其潛在影響,系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高,這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體(例如,人類對動物)或地標(例如,只有四肢對全身和臉部)而加劇,這使得標記的測試數據很少可以重複使用。為了解決這些問題,我們提出了 MET-POSE,這是一個姿勢估計系統的變形測試框架,在評估這些系統在不同情況下的性能時,可以繞過手動註解的需要。因此,MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統,而無需標記臨時測試數據集或僅依賴可用數據集,這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE,但我們也提供了一個非詳盡的變形規則列表,這些規則代表了電腦視覺應用中的常見挑戰,以及評估這些規則的具體方法。然後,我們通過將 MET-POSE 應用於 Mediapipe Holistic(一種先進的人類姿勢估計系統),並使用 FLIC 和 PHOENIX 數據集,以實驗方式展示 MET-POSE 的有效性。通過這些實驗,我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法,其速度與使用手動標記數據的傳統測試類似或更高,並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。 +摘要:本研究展示了對話式語音系統的開發和測試,該系統專為機器人設計,用於偵測失智症患者(PLwD)認知障礙的語言生物標記。該系統整合了後端 Python WebSocket 伺服器和一個中央核心模組,其中包含針對失智症微調的大語言模型(LLM),以處理使用者輸入並在不到 1.5 秒的時間內產生機器人對話回應。前端使用者介面(漸進式網路應用程式,PWA)會在智慧型手機上即時向人類使用者(PLwD、照護者、臨床醫生)顯示資訊和生物標記評分圖表。根據現有文獻,針對機器人對話系統開發了六個語言生物標記:語法改變、實用障礙、失語症、輪流中斷、發音不清和韻律變化,使用了兩個資料集,一個包含 PLwD 與人類臨床醫生對話(DementiaBank 資料集),另一個包含 PLwD 與機器人對話(Indiana 資料集)。我們還建立了一個複合語言生物標記,將所有六個個別生物標記組合成一個單一評分。語言系統的效能首先在 DementiaBank 資料集上進行評估,顯示與 MMSE 評分有中等相關性,複合生物標記評分優於個別生物標記。對 Indiana 資料集的分析顯示出較高且變異性較大的生物標記評分,這表明由於研究族群(例如失智症的嚴重程度)和對話情境(人機對話與人際對話不同)而產生潛在差異。研究結果強調需要進一步研究對話情境對語言生物標記的影響,以及機器人語言系統的潛在臨床應用。 -##### **Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling** -2502.09688v1 by Benjamin D. Killeen, Bohua Wan, Aditya V. Kulkarni, Nathan Drenkow, Michael Oberst, Paul H. Yi, Mathias Unberath +##### **Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)** +2502.10768v1 by Sandra Schaftner -Artificial intelligence (AI) is poised to transform healthcare by enabling -personalized and efficient care through data-driven insights. Although -radiology is at the forefront of AI adoption, in practice, the potential of AI -models is often overshadowed by severe failures to generalize: AI models can -have performance degradation of up to 20% when transitioning from controlled -test environments to clinical use by radiologists. This mismatch raises -concerns that radiologists will be misled by incorrect AI predictions in -practice and/or grow to distrust AI, rendering these promising technologies -practically ineffectual. Exhaustive clinical trials of AI models on abundant -and diverse data is thus critical to anticipate AI model degradation when -encountering varied data samples. Achieving these goals, however, is -challenging due to the high costs of collecting diverse data samples and -corresponding annotations. To overcome these limitations, we introduce a novel -conditional generative AI model designed for virtual clinical trials (VCTs) of -radiology AI, capable of realistically synthesizing full-body CT images of -patients with specified attributes. By learning the joint distribution of -images and anatomical structures, our model enables precise replication of -real-world patient populations with unprecedented detail at this scale. We -demonstrate meaningful evaluation of radiology AI models through VCTs powered -by our synthetic CT study populations, revealing model degradation and -facilitating algorithmic auditing for bias-inducing data attributes. Our -generative AI approach to VCTs is a promising avenue towards a scalable -solution to assess model robustness, mitigate biases, and safeguard patient -care by enabling simpler testing and evaluation of AI models in any desired -range of diverse patient populations. +Current research highlights the great potential of Large Language Models +(LLMs) for constructing Scholarly Knowledge Graphs (SKGs). One particularly +complex step in this process is relation extraction, aimed at identifying +suitable properties to describe the content of research. This study builds +directly on previous research of three Open Research Knowledge Graph (ORKG) +team members who assessed the readiness of LLMs such as GPT-3.5, Llama 2, and +Mistral for property extraction in scientific literature. Given the moderate +performance observed, the previous work concluded that fine-tuning is needed to +improve these models' alignment with scientific tasks and their emulation of +human expertise. Expanding on this prior experiment, this study evaluates the +impact of advanced prompt engineering techniques and demonstrates that these +techniques can highly significantly enhance the results. Additionally, this +study extends the property extraction process to include property matching to +existing ORKG properties, which are retrieved via the API. The evaluation +reveals that results generated through advanced prompt engineering achieve a +higher proportion of matches with ORKG properties, further emphasizing the +enhanced alignment achieved. Moreover, this lays the groundwork for addressing +challenges such as the inconsistency of ORKG properties, an issue highlighted +in prior studies. By assigning unique URIs and using standardized terminology, +this work increases the consistency of the properties, fulfilling a crucial +aspect of Linked Data and FAIR principles - core commitments of ORKG. This, in +turn, significantly enhances the applicability of ORKG content for subsequent +tasks such as comparisons of research publications. Finally, the study +concludes with recommendations for future improvements in the overall property +extraction process. -摘要:人工智慧 (AI) 準備透過資料驅動的見解,轉型醫療保健,並提供個人化且有效率的照護。儘管放射科處於 AI 採用的最前線,但在實務上,AI 模型的潛力往往會被嚴重的概化失敗所掩蓋:AI 模型在從受控測試環境轉移到放射科醫師的臨床使用時,效能可能會降低多達 20%。這種不匹配引發了疑慮,即放射科醫師在實務上會被不正確的 AI 預測誤導,和/或開始不信任 AI,讓這些有前景的技術在實務上形同失效。因此,在 AI 模型遭遇各種資料範例時,預期 AI 模型的衰退,對豐富且多樣化的資料進行 AI 模型的全面臨床試驗至關重要。然而,由於收集多樣化的資料範例和對應註解的成本很高,實現這些目標具有挑戰性。為了克服這些限制,我們引進一個創新的條件式生成式 AI 模型,專門用於放射科 AI 的虛擬臨床試驗 (VCT),能夠真實地合成具有特定屬性的病患全身電腦斷層 (CT) 影像。透過學習影像和解剖結構的聯合分佈,我們的模型能夠以空前的細節精確複製真實世界的病患族群。我們透過由我們合成的電腦斷層研究族群支援的 VCT,展示了放射科 AI 模型有意義的評估,揭露模型衰退,並促進演算法稽核,以找出導致偏差的資料屬性。我們對 VCT 的生成式 AI 方法,是一個有前景的途徑,可以評估模型的穩健性、減輕偏差,並透過在任何所需的各種病患族群中,進行更簡單的 AI 模型測試和評估,來保障病患照護。 +摘要:目前的調查強調大語言模型 (LLM) 在建構學術知識圖譜 (SKG) 上的巨大潛力。此過程中特別複雜的步驟是關係萃取,目標是找出合適的屬性來描述研究內容。本研究直接建立在三位開放研究知識圖譜 (ORKG) 團隊成員先前研究的基礎上,他們評估了 GPT-3.5、Llama 2 和 Mistral 等 LLM 在科學文獻中萃取屬性的準備情況。鑑於觀察到的表現中等,先前的研究結論是需要微調,以改善這些模型與科學任務的一致性,以及它們對人類專業知識的模擬。本研究擴展了先前的實驗,評估了進階提示工程技術的影響,並證明這些技術可以大幅顯著地提升結果。此外,本研究將屬性萃取流程擴展到包含與現有 ORKG 屬性的屬性比對,這些屬性是透過 API 擷取的。評估結果顯示,透過進階提示工程產生的結果與 ORKG 屬性有更高的比對比例,進一步強調所達成的進階一致性。此外,這也為了解決先前的研究中強調的問題,例如 ORKG 屬性的不一致性,奠定了基礎。透過指定唯一的 URI 並使用標準化的術語,本研究增加了屬性的相容性,達成了連結資料和 FAIR 原則的重要層面,這是 ORKG 的核心承諾。這反過來大幅提升了 ORKG 內容在後續任務中的適用性,例如研究出版品的比較。最後,本研究以針對整體屬性萃取流程未來改進的建議作為結論。 -##### **Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models** -2502.09687v1 by Wiktoria Mieleszczenko-Kowszewicz, Beata Bajcar, Jolanta Babiak, Berenika Dyczek, Jakub Świstak, Przemysław Biecek +##### **K-Edit: Language Model Editing with Contextual Knowledge Awareness** +2502.10626v1 by Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan -Be careful what you ask for, you just might get it. This saying fits with the -way large language models (LLMs) are trained, which, instead of being rewarded -for correctness, are increasingly rewarded for pleasing the recipient. So, they -are increasingly effective at persuading us that their answers are valuable. -But what tricks do they use in this persuasion? In this study, we examine what -are the psycholinguistic features of the responses used by twelve different -language models. By grouping response content according to rational or -emotional prompts and exploring social influence principles employed by LLMs, -we ask whether and how we can mitigate the risks of LLM-driven mass -misinformation. We position this study within the broader discourse on -human-centred AI, emphasizing the need for interdisciplinary approaches to -mitigate cognitive and societal risks posed by persuasive AI responses. +As the world changes, we need to be able to update our models and correct +false information without costly retraining. Knowledge-based model editing +enables precise modifications to the weights of large language models in order +to modify the information encoded within. Recent approaches have seen success +in enabling recall of edited information for thousands of edits at once. +However, these approaches fail to produce edits that account for associated +contextual information. We present K-Edit, an effective approach to generating +contextually consistent knowledge edits. By using knowledge graphs, which +maintain contextual consistency when an edge is edited, we are able to generate +additional \textit{contextual edits} that ensure consistency of related +information in the language model. Our experiments demonstrate significant +improvements in multi-hop question answering while maintaining the general +effectiveness and scalability of model edits. -摘要:小心你要求的,你可能真的會得到。這句話適用於大型語言模型 (LLM) 的訓練方式,它們不是因為正確性而獲得獎勵,而是因為取悅接收者而獲得越來越多的獎勵。因此,它們越來越有效地說服我們,它們的答案是有價值的。但是它們在這種說服中使用什麼技巧呢?在這項研究中,我們探討了十二種不同的語言模型使用的回應的心理語言特徵。通過根據理性和情緒提示對回應內容進行分組,並探討 LLM 使用的社會影響原則,我們探討是否以及如何減輕 LLM 驅動的大規模錯誤信息的風險。我們將這項研究定位在以人為中心的 AI 的更廣泛討論中,強調需要跨學科方法來減輕具有說服力的 AI 回應帶來的認知和社會風險。 +摘要:隨著世界變化,我們需要能夠更新我們的模型,並在不進行昂貴的重新訓練的情況下更正錯誤資訊。基於知識的模型編輯能夠對大型語言模型的權重進行精確修改,以便修改其中編碼的資訊。最近的方法在一次啟用數千次編輯的編輯資訊的召回方面取得了成功。然而,這些方法無法產生考慮相關上下文資訊的編輯。我們提出 K-Edit,這是一種產生上下文一致的知識編輯的有效方法。通過使用知識圖,在編輯邊緣時保持上下文一致性,我們能夠產生額外的「上下文編輯」,以確保語言模型中相關資訊的一致性。我們的實驗證明了多跳問題回答的顯著改進,同時保持了模型編輯的一般有效性和可擴充性。 -##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics** -2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing +##### **ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis** +2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan -Joint entity-relation extraction is a critical task in transforming -unstructured or semi-structured text into triplets, facilitating the -construction of large-scale knowledge graphs, and supporting various downstream -applications. Despite its importance, research on Chinese text, particularly -with complex semantics in specialized domains like medicine, remains limited. -To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions -dataset designed to capture the intricacies of medical text. Leveraging the -strengths of attention mechanisms in capturing long-range dependencies, we -propose the SEA module, which enhances the extraction of complex contextual -semantic information, thereby improving entity recognition and relation -extraction. Additionally, to address the inefficiencies of existing methods in -facilitating information exchange between entity recognition and relation -extraction, we present an interactive fusion representation module. This module -employs Cross Attention for bidirectional information exchange between the -tasks and further refines feature extraction through BiLSTM. Experimental -results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that -our model exhibits strong generalization capabilities. On the CH-DDI dataset, -our model achieves an F1-score of 96.73% for entity recognition and 78.43% for -relation extraction. On the CoNLL04 dataset, it attains an entity recognition -precision of 89.54% and a relation extraction accuracy of 71.64%. +Recent advancements in large language models (LLMs) have demonstrated +extraordinary comprehension capabilities with remarkable breakthroughs on +various vision-language tasks. However, the application of LLMs in generating +reliable medical diagnostic reports remains in the early stages. Currently, +medical LLMs typically feature a passive interaction model where doctors +respond to patient queries with little or no involvement in analyzing medical +images. In contrast, some ChatBots simply respond to predefined queries based +on visual inputs, lacking interactive dialogue or consideration of medical +history. As such, there is a gap between LLM-generated patient-ChatBot +interactions and those occurring in actual patient-doctor consultations. To +bridge this gap, we develop an LLM-based dialogue system, namely proactive +multi-round vision-language interactions for computer-aided diagnosis +(ProMRVL-CAD), to generate patient-friendly disease diagnostic reports. The +proposed ProMRVL-CAD system allows proactive dialogue to provide patients with +constant and reliable medical access via an integration of knowledge graph into +a recommendation system. Specifically, we devise two generators: a Proactive +Question Generator (Pro-Q Gen) to generate proactive questions that guide the +diagnostic procedure and a Multi-Vision Patient-Text Diagnostic Report +Generator (MVP-DR Gen) to produce high-quality diagnostic reports. Evaluating +two real-world publicly available datasets, MIMIC-CXR and IU-Xray, our model +has better quality in generating medical reports. We further demonstrate the +performance of ProMRVL achieves robust under the scenarios with low image +quality. Moreover, we have created a synthetic medical dialogue dataset that +simulates proactive diagnostic interactions between patients and doctors, +serving as a valuable resource for training LLM. + +摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。 + +##### **GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs** +2502.10522v1 by Shima Khoshraftar, Niaz Abedini, Amir Hajian + +The application of large language models (LLMs) to graph data has attracted a +lot of attention recently. LLMs allow us to use deep contextual embeddings from +pretrained models in text-attributed graphs, where shallow embeddings are often +used for the text attributes of nodes. However, it is still challenging to +efficiently encode the graph structure and features into a sequential form for +use by LLMs. In addition, the performance of an LLM alone, is highly dependent +on the structure of the input prompt, which limits their effectiveness as a +reliable approach and often requires iterative manual adjustments that could be +slow, tedious and difficult to replicate programmatically. In this paper, we +propose GraphiT (Graphs in Text), a framework for encoding graphs into a +textual format and optimizing LLM prompts for graph prediction tasks. Here we +focus on node classification for text-attributed graphs. We encode the graph +data for every node and its neighborhood into a concise text to enable LLMs to +better utilize the information in the graph. We then further programmatically +optimize the LLM prompts using the DSPy framework to automate this step and +make it more efficient and reproducible. GraphiT outperforms our LLM-based +baselines on three datasets and we show how the optimization step in GraphiT +leads to measurably better results without manual prompt tweaking. We also +demonstrated that our graph encoding approach is competitive to other graph +encoding methods while being less expensive because it uses significantly less +tokens for the same task. -摘要:聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務,有助於建構大規模知識圖譜,並支援各種下游應用程式。儘管其重要性,但針對中文文本的研究,特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距,我們引入了 CH-DDI,一個中文藥物-藥物交互作用資料集,旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢,我們提出了 SEA 模組,增強了複雜脈絡語義資訊的抽取,從而改進了實體辨識和關係抽取。此外,為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題,我們提出了互動式融合表示模組。此模組採用交叉注意力,在任務之間進行雙向資訊交換,並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明,我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上,我們的模型在實體辨識方面達到了 96.73% 的 F1 分數,在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上,它在實體辨識方面達到了 89.54% 的準確度,在關係抽取方面達到了 71.64% 的準確度。 +摘要:大型語言模型 (LLM) 在圖表資料的應用最近備受關注。LLM 讓我們能夠在文字標記圖表中使用預訓練模型的深度脈絡嵌入,其中淺層嵌入通常用於節點的文字屬性。然而,要有效率地將圖表結構和特徵編碼成序列形式供 LLM 使用,仍然是一項挑戰。此外,單獨 LLM 的效能高度依賴輸入提示的結構,這限制了它們作為可靠方法的有效性,而且通常需要反覆的人工調整,這可能會緩慢、繁瑣且難以透過程式複製。在本文中,我們提出 GraphiT(文字中的圖表),一個用於將圖表編碼成文字格式並最佳化 LLM 提示以進行圖表預測任務的架構。在這裡,我們專注於文字標記圖表的節點分類。我們將每個節點及其鄰域的圖表資料編碼成簡潔的文字,讓 LLM 能夠更好地利用圖表中的資訊。然後,我們進一步透過程式最佳化 LLM 提示,使用 DSPy 架構自動化這個步驟,並使其更有效率且可複製。Graphite 在三個資料集上優於我們的基於 LLM 的基準,我們展示了 GraphiT 中的最佳化步驟如何導致顯著更好的結果,而無需手動調整提示。我們還證明了我們的圖表編碼方法與其他圖表編碼方法具有競爭力,同時成本更低,因為它在相同的任務中使用了顯著更少的標記。 -##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine** -2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh +##### **Do Large Language Models Reason Causally Like Us? Even Better?** +2502.10215v1 by Hanna M. Dettki, Brenden M. Lake, Charley M. Wu, Bob Rehder -Generative artificial intelligence (AI) models, such as diffusion models and -OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy -and automating clinical workflows. The field has advanced rapidly, evolving -from text-only large language models for tasks such as clinical documentation -and decision support to multimodal AI systems capable of integrating diverse -data modalities, including imaging, text, and structured data, within a single -model. The diverse landscape of these technologies, along with rising interest, -highlights the need for a comprehensive review of their applications and -potential. This scoping review explores the evolution of multimodal AI, -highlighting its methods, applications, datasets, and evaluation in clinical -settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed, -IEEE Xplore, and Web of Science, prioritizing recent studies published up to -the end of 2024. After rigorous screening, 144 papers were included, revealing -key trends and challenges in this dynamic field. Our findings underscore a -shift from unimodal to multimodal approaches, driving innovations in diagnostic -support, medical report generation, drug discovery, and conversational AI. -However, critical challenges remain, including the integration of heterogeneous -data types, improving model interpretability, addressing ethical concerns, and -validating AI systems in real-world clinical settings. This review summarizes -the current state of the art, identifies critical gaps, and provides insights -to guide the development of scalable, trustworthy, and clinically impactful -multimodal AI solutions in healthcare. +Causal reasoning is a core component of intelligence. Large language models +(LLMs) have shown impressive capabilities in generating human-like text, +raising questions about whether their responses reflect true understanding or +statistical patterns. We compared causal reasoning in humans and four LLMs +using tasks based on collider graphs, rating the likelihood of a query variable +occurring given evidence from other variables. We find that LLMs reason +causally along a spectrum from human-like to normative inference, with +alignment shifting based on model, context, and task. Overall, GPT-4o and +Claude showed the most normative behavior, including "explaining away", whereas +Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected +independence of causes - Claude the least - they exhibited strong associative +reasoning and predictive inference when assessing the likelihood of the effect +given its causes. These findings underscore the need to assess AI biases as +they increasingly assist human decision-making. -摘要:生成式人工智能 (AI) 模型,例如扩散模型和 OpenAI 的 ChatGPT,通过提高诊断准确性和自动化临床工作流程,正在改变医学领域。该领域已迅速发展,从用于临床文件编制和决策支持等任务的纯文本大型语言模型,发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣,凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变,重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南,我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science,优先考虑截至 2024 年底发表的最新研究。经过严格筛选,纳入了 144 篇论文,揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变,推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而,关键挑战仍然存在,包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术,确定了关键差距,并提供了见解,以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。 +摘要:因果推理是智能的核心組成部分。大型語言模型 (LLM) 在生成類人文本方面展現了令人印象深刻的能力,引發了關於它們的回應是否反映真實理解或統計模式的疑問。我們使用基於碰撞圖的任務比較了人類和四個 LLM 中的因果推理,根據其他變數的證據評估查詢變數發生的可能性。我們發現 LLM 沿著從類人到規範推論的光譜進行因果推理,對齊會根據模型、上下文和任務而改變。總體而言,GPT-4o 和 Claude 表現出最規範的行為,包括「解釋」,而 Gemini-Pro 和 GPT-3.5 則沒有。儘管所有代理都偏離了預期的原因獨立性 - Claude 最不偏離 - 但它們在評估給定原因的效果可能性時表現出強烈的關聯推理和預測推論。這些發現強調了評估 AI 偏差的必要性,因為它們越來越協助人類決策。 -##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** -2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano +##### **Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages** +2502.10140v1 by Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann -This paper presents a complete explainable system that interprets a set of -data, abstracts the underlying features and describes them in a natural -language of choice. The system relies on two crucial stages: (i) identifying -emerging properties from data and transforming them into abstract concepts, and -(ii) converting these concepts into natural language. Despite the impressive -natural language generation capabilities demonstrated by Large Language Models, -their statistical nature and the intricacy of their internal mechanism still -force us to employ these techniques as black boxes, forgoing trustworthiness. -Developing an explainable pipeline for data interpretation would allow -facilitating its use in safety-critical environments like processing medical -information and allowing non-experts and visually impaired people to access -narrated information. To this end, we believe that the fields of knowledge -representation and automated reasoning research could present a valid -alternative. Expanding on prior research that tackled the first stage (i), we -focus on the second stage, named Concept2Text. Being explainable, data -translation is easily modeled through logic-based rules, once again emphasizing -the role of declarative programming in achieving AI explainability. This paper -explores a Prolog/CLP-based rewriting system to interpret concepts-articulated -in terms of classes and relations, plus common knowledge-derived from a generic -ontology, generating natural language text. Its main features include -hierarchical tree rewritings, modular multilingual generation, support for -equivalent variants across semantic, grammar, and lexical levels, and a -transparent rule-based system. We outline the architecture and demonstrate its -flexibility through some examples capable of generating numerous diverse and -equivalent rewritings based on the input concept. +Low-resource languages (LRLs) face significant challenges in natural language +processing (NLP) due to limited data. While current state-of-the-art large +language models (LLMs) still struggle with LRLs, smaller multilingual models +(mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of +their capacity to low training data sizes. This study systematically +investigates parameter-efficient adapter-based methods for adapting mLMs to +LRLs, evaluating three architectures: Sequential Bottleneck, Invertible +Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and +structured knowledge from ConceptNet, we show that small adaptation datasets +(e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains +in intrinsic (masked language modeling) and extrinsic tasks (topic +classification, sentiment analysis, and named entity recognition). We find that +Sequential Bottleneck adapters excel in language modeling, while Invertible +Bottleneck adapters slightly outperform other methods on downstream tasks due +to better embedding alignment and larger parameter counts. Adapter-based +methods match or outperform full fine-tuning while using far fewer parameters, +and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, +GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves +performance, pre-training data size remains the dominant factor, especially for +languages with extensive pre-training coverage. -摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 +摘要:低資源語言 (LRL) 由於資料有限,在自然語言處理 (NLP) 中面臨重大挑戰。雖然當前最先進的大型語言模型 (LLM) 仍難以處理 LRL,但較小的多語言模型 (mLMS),例如 mBERT 和 XLM-R,由於其容量更適合低訓練資料大小,因此提供了更大的希望。本研究系統性地探討了基於參數效率適配器的適配方法,以將 mLMS 適配到 LRL,評估了三種架構:順序瓶頸、可逆瓶頸和低秩適配。使用來自 GlotCC 的非結構化文本和來自 ConceptNet 的結構化知識,我們表明小型適配資料集(例如,高達 1 GB 的自由文本或幾 MB 的知識圖譜資料)在內在(遮蔽語言模型)和外在任務(主題分類、情緒分析和命名實體識別)中產生增益。我們發現順序瓶頸適配器在語言模型中表現出色,而可逆瓶頸適配器由於更好的嵌入對齊和更大的參數數量,在下游任務上略勝於其他方法。基於適配器的方法在使用更少參數的同時,可以匹配或優於完全微調,而較小的 mLM 被證明比 LLaMA-3、GPT-4 和基於 DeepSeek-R1 的蒸餾模型等大型 LLM 更適合 LRL。雖然適配可以提高效能,但預訓練資料大小仍然是主要因素,特別是對於預訓練覆蓋範圍廣泛的語言。 -##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York** -2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu +##### **Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models** +2502.10090v1 by Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao -Legal cases require careful logical reasoning following the laws, whereas -interactions with non-technical users must be in natural language. As an -application combining logical reasoning using Prolog and natural language -processing using large language models (LLMs), this paper presents a novel -approach and system, LogicLease, to automate the analysis of landlord-tenant -legal cases in the state of New York. LogicLease determines compliance with -relevant legal requirements by analyzing case descriptions and citing all -relevant laws. It leverages LLMs for information extraction and Prolog for -legal reasoning. By separating information extraction from legal reasoning, -LogicLease achieves greater transparency and control over the legal logic -applied to each case. We evaluate the accuracy, efficiency, and robustness of -LogicLease through a series of tests, achieving 100% accuracy and an average -processing time of 2.57 seconds. LogicLease presents advantages over -state-of-the-art LLM-based legal analysis systems by providing clear, -step-by-step reasoning, citing specific laws, and distinguishing itself by its -ability to avoid hallucinations -- a common issue in LLMs. +Humans possess an extraordinary ability to understand and execute complex +manipulation tasks by interpreting abstract instruction manuals. For robots, +however, this capability remains a substantial challenge, as they cannot +interpret abstract instructions and translate them into executable actions. In +this paper, we present Manual2Skill, a novel framework that enables robots to +perform complex assembly tasks guided by high-level manual instructions. Our +approach leverages a Vision-Language Model (VLM) to extract structured +information from instructional images and then uses this information to +construct hierarchical assembly graphs. These graphs represent parts, +subassemblies, and the relationships between them. To facilitate task +execution, a pose estimation model predicts the relative 6D poses of components +at each assembly step. At the same time, a motion planning module generates +actionable sequences for real-world robotic implementation. We demonstrate the +effectiveness of Manual2Skill by successfully assembling several real-world +IKEA furniture items. This application highlights its ability to manage +long-horizon manipulation tasks with both efficiency and precision, +significantly enhancing the practicality of robot learning from instruction +manuals. This work marks a step forward in advancing robotic systems capable of +understanding and executing complex manipulation tasks in a manner akin to +human capabilities. -摘要:法律案件需要遵循法律进行谨慎的逻辑推理,而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序,本文提出了一种新颖的方法和系统 LogicLease,以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取,并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开,LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性,实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理,引用具体法律,并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统,从而显示出优势——这是 LLM 中的常见问题。 +摘要:人類擁有理解並執行複雜操作任務的非凡能力,方法是詮釋抽象的說明手冊。然而,對機器人來說,這項能力仍然是一項重大的挑戰,因為它們無法詮釋抽象的指令並將其轉換為可執行的動作。在本文中,我們提出了 Manual2Skill,這是一個新穎的框架,使機器人能夠在高階手冊說明的指導下執行複雜的組裝任務。我們的做法利用視覺語言模型 (VLM) 從教學圖片中提取結構化資訊,然後使用此資訊來建構階層式組裝圖。這些圖表示零件、子組件以及它們之間的關係。為了促進任務執行,姿勢估計模型會預測每個組裝步驟中組件的相對 6D 姿勢。同時,動作規劃模組會產生適用於實際機器人實作的可操作順序。我們透過成功組裝幾個真實世界的 IKEA 家具來展示 Manual2Skill 的有效性。此應用程式突顯了它以高效率和高精準度管理長時程操作任務的能力,大幅提升機器人從說明手冊中學習的實用性。這項工作標誌著機器人系統在理解和執行複雜操作任務方面向前邁進了一步,其方式類似於人類的能力。 -##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia** -2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott +##### **Decision Information Meets Large Language Models: The Future of Explainable Operations Research** +2502.09994v1 by Yansen Zhang, Qingcan Kang, Wing Yin Yu, Hailei Gong, Xiaojin Fu, Xiongwei Han, Tao Zhong, Chen Ma -In remote healthcare monitoring, time series representation learning reveals -critical patient behavior patterns from high-frequency data. This study -analyzes home activity data from individuals living with dementia by proposing -a two-stage, self-supervised learning approach tailored to uncover low-rank -structures. The first stage converts time-series activities into text sequences -encoded by a pre-trained language model, providing a rich, high-dimensional -latent state space using a PageRank-based method. This PageRank vector captures -latent state transitions, effectively compressing complex behaviour data into a -succinct form that enhances interpretability. This low-rank representation not -only enhances model interpretability but also facilitates clustering and -transition analysis, revealing key behavioral patterns correlated with -clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the -framework's potential in supporting cognitive status prediction, personalized -care interventions, and large-scale health monitoring. +Operations Research (OR) is vital for decision-making in many industries. +While recent OR methods have seen significant improvements in automation and +efficiency through integrating Large Language Models (LLMs), they still +struggle to produce meaningful explanations. This lack of clarity raises +concerns about transparency and trustworthiness in OR applications. To address +these challenges, we propose a comprehensive framework, Explainable Operations +Research (EOR), emphasizing actionable and understandable explanations +accompanying optimization. The core of EOR is the concept of Decision +Information, which emerges from what-if analysis and focuses on evaluating the +impact of complex constraints (or parameters) changes on decision-making. +Specifically, we utilize bipartite graphs to quantify the changes in the OR +model and adopt LLMs to improve the explanation capabilities. Additionally, we +introduce the first industrial benchmark to rigorously evaluate the +effectiveness of explanations and analyses in OR, establishing a new standard +for transparency and clarity in the field. -摘要:在遠程醫療監控中,時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據,該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列,使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換,有效地將複雜的行為數據壓縮成簡潔的形式,從而增強了解力。此低秩表示不僅增強了模型的可解釋性,還促進了聚類和轉換分析,揭示了與臨床指標(例如 MMSE 和 ADAS-COG 分數)相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。 +摘要:作業研究 (OR) 對許多產業的決策制定至關重要。雖然近期的 OR 方法已透過整合大型語言模型 (LLM) 在自動化和效率方面取得顯著的進步,但它們在產生有意義的解釋方面仍面臨挑戰。這種缺乏明確性的情況會對 OR 應用中的透明度和可信度造成疑慮。為了應對這些挑戰,我們提出一個全面的架構,即可解釋作業研究 (EOR),強調在最佳化過程中提供可操作且易於理解的解釋。EOR 的核心是決策資訊的概念,它源自假設分析,並專注於評估複雜約束條件 (或參數) 變更對決策制定的影響。具體來說,我們利用二部圖量化 OR 模型的變化,並採用 LLM 來改善解釋能力。此外,我們引入了第一個產業基準,以嚴格評估 OR 中解釋和分析的有效性,為該領域的透明度和清晰度建立新的標準。 -##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design** -2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang +##### **KGGen: Extracting Knowledge Graphs from Plain Text with Language Models** +2502.09956v1 by Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo -Taste peptides have emerged as promising natural flavoring agents attributed -to their unique organoleptic properties, high safety profile, and potential -health benefits. However, the de novo identification of taste peptides derived -from animal, plant, or microbial sources remains a time-consuming and -resource-intensive process, significantly impeding their widespread application -in the food industry. Here, we present TastePepAI, a comprehensive artificial -intelligence framework for customized taste peptide design and safety -assessment. As the key element of this framework, a loss-supervised adaptive -variational autoencoder (LA-VAE) is implemented to efficiently optimizes the -latent representation of sequences during training and facilitates the -generation of target peptides with desired taste profiles. Notably, our model -incorporates a novel taste-avoidance mechanism, allowing for selective flavor -exclusion. Subsequently, our in-house developed toxicity prediction algorithm -(SpepToxPred) is integrated in the framework to undergo rigorous safety -evaluation of generated peptides. Using this integrated platform, we -successfully identified 73 peptides exhibiting sweet, salty, and umami, -significantly expanding the current repertoire of taste peptides. This work -demonstrates the potential of TastePepAI in accelerating taste peptide -discovery for food applications and provides a versatile framework adaptable to -broader peptide engineering challenges. +Recent interest in building foundation models for KGs has highlighted a +fundamental challenge: knowledge-graph data is relatively scarce. The +best-known KGs are primarily human-labeled, created by pattern-matching, or +extracted using early NLP techniques. While human-generated KGs are in short +supply, automatically extracted KGs are of questionable quality. We present a +solution to this data scarcity problem in the form of a text-to-KG generator +(KGGen), a package that uses language models to create high-quality graphs from +plaintext. Unlike other KG extractors, KGGen clusters related entities to +reduce sparsity in extracted KGs. KGGen is available as a Python library +(\texttt{pip install kg-gen}), making it accessible to everyone. Along with +KGGen, we release the first benchmark, Measure of of Information in Nodes and +Edges (MINE), that tests an extractor's ability to produce a useful KG from +plain text. We benchmark our new tool against existing extractors and +demonstrate far superior performance. -摘要:味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而,从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程,严重阻碍了它们在食品工业中的广泛应用。在此,我们提出了 TastePepAI,这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素,实现了损失监督自适应变分自动编码器 (LA-VAE),以在训练期间有效优化序列的潜在表示,并促进生成具有所需味觉特征的目标肽。值得注意的是,我们的模型包含了一种新颖的味觉回避机制,允许选择性排除风味。随后,我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中,以对生成的肽进行严格的安全评估。使用这个集成平台,我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽,极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力,并提供了一个适用于更广泛的肽工程挑战的多功能框架。 +摘要:最近对于构建知识图谱基础模型的兴趣凸显了一个基本挑战:知识图谱数据相对稀缺。最知名的知识图谱主要为人标注,由模式匹配创建,或使用早期自然语言处理技术提取。虽然人生成的知识图谱供不应求,但自动提取的知识图谱质量堪忧。我们以文本到知识图谱生成器 (KGGen) 的形式为这一数据稀缺问题提供了一个解决方案,这是一个使用语言模型从纯文本创建高质量图表的包。与其他知识图谱提取器不同,KGGen 对相关实体进行聚类以减少提取的知识图谱中的稀疏性。KGGen 可用作 Python 库(\texttt{pip install kg-gen}),使其所有人都能访问。除了 KGGen,我们还发布了第一个基准测试,即节点和边信息度量 (MINE),它测试了提取器从纯文本生成有用知识图谱的能力。我们针对现有提取器对我们的新工具进行基准测试,并展示了远超其性能。 -##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification** -2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan +##### **ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation** +2502.09891v1 by Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, Yuchi Ma -Precise segmentation and classification of cell instances are vital for -analyzing the tissue microenvironment in histology images, supporting medical -diagnosis, prognosis, treatment planning, and studies of brain -cytoarchitecture. However, the creation of high-quality annotated datasets for -training remains a major challenge. This study introduces a novel single-stage -approach (HistoSmith) for generating image-label pairs to augment histology -datasets. Unlike state-of-the-art methods that utilize diffusion models with -separate components for label and image generation, our approach employs a -latent diffusion model to learn the joint distribution of cellular layouts, -classification masks, and histology images. This model enables tailored data -generation by conditioning on user-defined parameters such as cell types, -quantities, and tissue types. Trained on the Conic H&E histopathology dataset -and the Nissl-stained CytoDArk0 dataset, the model generates realistic and -diverse labeled samples. Experimental results demonstrate improvements in cell -instance segmentation and classification, particularly for underrepresented -cell types like neutrophils in the Conic dataset. These findings underscore the -potential of our approach to address data scarcity challenges. +Retrieval-Augmented Generation (RAG) has proven effective in integrating +external knowledge into large language models (LLMs) for question-answer (QA) +tasks. The state-of-the-art RAG approaches often use the graph data as the +external data since they capture the rich semantic information and link +relationships between entities. However, existing graph-based RAG approaches +cannot accurately identify the relevant information from the graph and also +consume large numbers of tokens in the online retrieval process. To address +these issues, we introduce a novel graph-based RAG approach, called Attributed +Community-based Hierarchical RAG (ArchRAG), by augmenting the question using +attributed communities, and also introducing a novel LLM-based hierarchical +clustering method. To retrieve the most relevant information from the graph for +the question, we build a novel hierarchical index structure for the attributed +communities and develop an effective online retrieval method. Experimental +results demonstrate that ArchRAG outperforms existing methods in terms of both +accuracy and token cost. -摘要:精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而,建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith),用於產生影像標籤對,以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同,我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數(例如細胞類型、數量和組織類型)來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後,此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步,特別是對於 Conic 資料集中代表性不足的細胞類型,例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。 +摘要:檢索增強生成 (RAG) 已證明可將外部知識整合到大型語言模型 (LLM),用於問答 (QA) 任務。最先進的 RAG 方法通常使用圖形資料作為外部資料,因為它們擷取了豐富的語意資訊和實體之間的連結關係。然而,現有的基於圖形的 RAG 方法無法準確識別圖形中的相關資訊,而且在線上檢索過程中也會消耗大量的符號。為了解決這些問題,我們提出了一種新穎的基於圖形的 RAG 方法,稱為基於屬性社群的分層 RAG (ArchRAG),透過使用屬性社群來擴充問題,並引入一種新穎的基於 LLM 的分層聚類方法。為了從圖形中檢索與問題最相關的資訊,我們為屬性社群建立了一個新穎的分層索引結構,並開發了一種有效的線上檢索方法。實驗結果證明,ArchRAG 在準確性和符號成本方面都優於現有方法。 -##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion** -2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì +##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing** +2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch -The growing availability of longitudinal Magnetic Resonance Imaging (MRI) -datasets has facilitated Artificial Intelligence (AI)-driven modeling of -disease progression, making it possible to predict future medical scans for -individual patients. However, despite significant advancements in AI, current -methods continue to face challenges including achieving patient-specific -individualization, ensuring spatiotemporal consistency, efficiently utilizing -longitudinal data, and managing the substantial memory demands of 3D scans. To -address these challenges, we propose Brain Latent Progression (BrLP), a novel -spatiotemporal model designed to predict individual-level disease progression -in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates -in a small latent space, mitigating the computational challenges posed by -high-dimensional imaging data; (ii) it explicitly integrates subject metadata -to enhance the individualization of predictions; (iii) it incorporates prior -knowledge of disease dynamics through an auxiliary model, facilitating the -integration of longitudinal data; and (iv) it introduces the Latent Average -Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in -the predicted progression at inference time and (b) allows us to derive a -measure of the uncertainty for the prediction. We train and evaluate BrLP on -11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its -generalizability on an external test set comprising 2,257 MRIs from 962 -subjects. Our experiments compare BrLP-generated MRI scans with real follow-up -MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The -code is publicly available at: https://github.com/LemuelPuglisi/BrLP. +Visual Question Answering (VQA) is a challenging problem that requires to +process multimodal input. Answer-Set Programming (ASP) has shown great +potential in this regard to add interpretability and explainability to modular +VQA architectures. In this work, we address the problem of how to integrate ASP +with modules for vision and natural language processing to solve a new and +demanding VQA variant that is concerned with images of graphs (not graphs in +symbolic form). Images containing graph-based structures are an ubiquitous and +popular form of visualisation. Here, we deal with the particular problem of +graphs inspired by transit networks, and we introduce a novel dataset that +amends an existing one by adding images of graphs that resemble metro lines. +Our modular neuro-symbolic approach combines optical graph recognition for +graph parsing, a pretrained optical character recognition neural network for +parsing labels, Large Language Models (LLMs) for language processing, and ASP +for reasoning. This method serves as a first baseline and achieves an overall +average accuracy of 73% on the dataset. Our evaluation provides further +evidence of the potential of modular neuro-symbolic systems, in particular with +pretrained models that do not involve any further training and logic +programming for reasoning, to solve complex VQA tasks. -摘要:隨著縱向磁共振影像 (MRI) 資料集的日益普及,已促進人工智慧 (AI) 驅動的疾病進程建模,讓預測個別患者的未來醫學掃描成為可能。然而,儘管 AI 有顯著進展,目前的技術仍面臨挑戰,包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料,以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰,我們提出腦潛在進程 (BrLP),這是一種新穎的時空模型,旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個:(i) 它在一個小的潛在空間中運作,減輕了高維度影像資料帶來的計算挑戰;(ii) 它明確整合受試者的元資料,以增強預測的個別化;(iii) 它透過輔助模型納入疾病動態的先驗知識,促進縱向資料的整合;(iv) 它引入了潛在平均穩定化 (LAS) 演算法,該演算法 (a) 在推論時強制預測進程中的時空一致性,(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估,並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較,與現有方法相比,展示了最先進的準確性。程式碼已公開於:https://github.com/LemuelPuglisi/BrLP。 +摘要:視覺問答(VQA)是一項具有挑戰性的問題,需要處理多模態輸入。答案集程式設計(ASP)在這方面顯示出巨大的潛力,可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中,我們探討如何將 ASP 與視覺和自然語言處理模組整合,以解決一個新的且要求嚴格的 VQA 變體,該變體與圖形影像(而非符號形式的圖形)有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡,我們處理受交通網路啟發的圖形特定問題,並引入一個新的資料集,透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型(LLM)進行語言處理,以及 ASP 進行推理。此方法作為第一個基準,在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力,特別是預先訓練的模型,這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理,以解決複雜的 VQA 任務。 ##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data** 2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai @@ -4499,3456 +3776,4108 @@ individual institutions. 摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。 -##### **EEG Artifact Detection and Correction with Deep Autoencoders** -2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch +##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy** +2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang -EEG signals convey important information about brain activity both in healthy -and pathological conditions. However, they are inherently noisy, which poses -significant challenges for accurate analysis and interpretation. Traditional -EEG artifact removal methods, while effective, often require extensive expert -intervention. This study presents LSTEEG, a novel LSTM-based autoencoder -designed for the detection and correction of artifacts in EEG signals. -Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear -dependencies in sequential EEG data. LSTEEG demonstrates superior performance -in both artifact detection and correction tasks compared to other -state-of-the-art convolutional autoencoders. Our methodology enhances the -interpretability and utility of the autoencoder's latent space, enabling -data-driven automated artefact removal in EEG its application in downstream -tasks. This research advances the field of efficient and accurate multi-channel -EEG preprocessing, and promotes the implementation and usage of automated EEG -analysis pipelines for brain health applications. +With the extensive application of Graph Neural Networks (GNNs) across various +domains, their trustworthiness has emerged as a focal point of research. Some +existing studies have shown that the integration of large language models +(LLMs) can improve the semantic understanding and generation capabilities of +GNNs, which in turn improves the trustworthiness of GNNs from various aspects. +Our review introduces a taxonomy that offers researchers a clear framework for +comprehending the principles and applications of different methods and helps +clarify the connections and differences among various approaches. Then we +systematically survey representative approaches along the four categories of +our taxonomy. Through our taxonomy, researchers can understand the applicable +scenarios, potential advantages, and limitations of each approach for the the +trusted integration of GNNs with LLMs. Finally, we present some promising +directions of work and future trends for the integration of LLMs and GNNs to +improve model trustworthiness. + +摘要:隨著圖神經網路 (GNN) 在各種領域的廣泛應用,其可信度已成為研究的焦點。一些現有研究表明,整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力,進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法,為研究人員提供了一個清晰的架構,用於理解不同方法的原理和應用,並有助於釐清各種方法之間的關聯和差異。然後,我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法,可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後,我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢,以提升模型的可信度。 + +##### **Graph Foundation Models for Recommendation: A Comprehensive Survey** +2502.08346v3 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi + +Recommender systems (RS) serve as a fundamental tool for navigating the vast +expanse of online information, with deep learning advancements playing an +increasingly important role in improving ranking accuracy. Among these, graph +neural networks (GNNs) excel at extracting higher-order structural information, +while large language models (LLMs) are designed to process and comprehend +natural language, making both approaches highly effective and widely adopted. +Recent research has focused on graph foundation models (GFMs), which integrate +the strengths of GNNs and LLMs to model complex RS problems more efficiently by +leveraging the graph-based structure of user-item relationships alongside +textual understanding. In this survey, we provide a comprehensive overview of +GFM-based RS technologies by introducing a clear taxonomy of current +approaches, diving into methodological details, and highlighting key challenges +and future directions. By synthesizing recent advancements, we aim to offer +valuable insights into the evolving landscape of GFM-based recommender systems. + +摘要:推薦系統 (RS) 是用於導航廣闊的線上資訊的基本工具,深度學習的進步在提升排名準確度方面扮演著日益重要的角色。其中,圖形神經網路 (GNN) 擅長萃取高階結構資訊,而大型語言模型 (LLM) 則設計用於處理和理解自然語言,這使得這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM),它整合了 GNN 和 LLM 的優點,透過利用使用者與項目關係的圖形化結構以及文字理解,更有效率地建構複雜的 RS 問題模型。在這項調查中,我們透過介紹當前方法的明確分類、深入探討方法論細節,以及強調關鍵挑戰和未來方向,提供了 GFM 為基礎的 RS 技術的全面概觀。透過綜合最近的進展,我們旨在提供對 GFM 為基礎的推薦系統不斷演變的版圖的寶貴見解。 + +##### **Self-Evaluation for Job-Shop Scheduling** +2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana + +Combinatorial optimization problems, such as scheduling and route planning, +are crucial in various industries but are computationally intractable due to +their NP-hard nature. Neural Combinatorial Optimization methods leverage +machine learning to address these challenges but often depend on sequential +decision-making, which is prone to error accumulation as small mistakes +propagate throughout the process. Inspired by self-evaluation techniques in +Large Language Models, we propose a novel framework that generates and +evaluates subsets of assignments, moving beyond traditional stepwise +approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a +heterogeneous graph neural network with a Transformer to build a policy model +and a self-evaluation function. Experimental validation on challenging, +well-known benchmarks demonstrates the effectiveness of our approach, +surpassing state-of-the-art methods. + +摘要:組合優化問題,例如排程和路線規劃,在各行各業中至關重要,但由於它們的 NP 難度,在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰,但通常依賴於序貫決策制定,而序貫決策制定容易發生錯誤累積,因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發,我們提出了一個新的框架,可生成和評估作業子集,超越傳統的分步方法。應用於工作車間排程問題,我們的方法將異質圖神經網路與 Transformer 整合在一起,以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性,超越了最先進的方法。 + +##### **Improving Existing Optimization Algorithms with LLMs** +2502.08298v1 by Camilo Chacón Sartori, Christian Blum + +The integration of Large Language Models (LLMs) into optimization has created +a powerful synergy, opening exciting research opportunities. This paper +investigates how LLMs can enhance existing optimization algorithms. Using their +pre-trained knowledge, we demonstrate their ability to propose innovative +heuristic variations and implementation strategies. To evaluate this, we +applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt +(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that +incorporates a heuristic in the solution construction phase. Our results show +that an alternative heuristic proposed by GPT-4o outperforms the +expert-designed heuristic of CMSA, with the performance gap widening on larger +and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/ + +摘要:大型语言模型 (LLM) 与优化相结合,创造了一种强大的协同作用,开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识,我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点,我们应用了一种非平凡的优化算法,构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法,它在求解构建阶段纳入了启发式算法。我们的结果表明,GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法,并且随着图形变得更大、更密集,性能差距也在扩大。项目网址:https://imp-opt-algo-llms.surge.sh/ + +##### **LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search** +2502.10459v1 by Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang + +Graph Neural Architecture Search (GNAS) facilitates the automatic design of +Graph Neural Networks (GNNs) tailored to specific downstream graph learning +tasks. However, existing GNAS approaches often require manual adaptation to new +graph search spaces, necessitating substantial code optimization and +domain-specific knowledge. To address this challenge, we present LLM4GNAS, a +toolkit for GNAS that leverages the generative capabilities of Large Language +Models (LLMs). LLM4GNAS includes an algorithm library for graph neural +architecture search algorithms based on LLMs, enabling the adaptation of GNAS +methods to new search spaces through the modification of LLM prompts. This +approach reduces the need for manual intervention in algorithm adaptation and +code modification. The LLM4GNAS toolkit is extensible and robust, incorporating +LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture +search, and LLM-enhanced hyperparameter optimization. Experimental results +indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving +both homogeneous and heterogeneous graphs. + +摘要:圖形神經架構搜尋 (GNAS) 促進圖形神經網路 (GNN) 的自動設計,以符合特定下游圖形學習任務。然而,現有的 GNAS 方法通常需要手動調整至新的圖形搜尋空間,這需要大量的程式碼最佳化和領域特定知識。為了應對這項挑戰,我們提出 LLM4GNAS,一個利用大型語言模型 (LLM) 的生成能力的 GNAS 工具包。LLM4GNAS 包含一個基於 LLM 的圖形神經架構搜尋演算法函式庫,讓 GNAS 方法能夠透過修改 LLM 提示來適應新的搜尋空間。這種方法減少了演算法適應和程式碼修改中手動介入的需要。LLM4GNAS 工具包具有可擴充性和穩健性,整合了 LLM 增強的圖形特徵工程、LLM 增強的圖形神經架構搜尋和 LLM 增強的超參數最佳化。實驗結果表明,LLM4GNAS 在涉及同質和異質圖形的任務上優於現有的 GNAS 方法。 + +##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning** +2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari + +Identifying cause-and-effect relationships is critical to understanding +real-world dynamics and ultimately causal reasoning. Existing methods for +identifying event causality in NLP, including those based on Large Language +Models (LLMs), exhibit difficulties in out-of-distribution settings due to the +limited scale and heavy reliance on lexical cues within available benchmarks. +Modern benchmarks, inspired by probabilistic causal inference, have attempted +to construct causal graphs of events as a robust representation of causal +knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent +benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a +benchmark designed for discovery and reasoning over abstract causal events. +Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday +life events on the abstraction level. We propose a pipeline for identifying +abstractions for event generalizations from \texttt{GLUCOSE} +\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit +commonsense causal knowledge, from which we subsequently extract $1,4$K causal +pairs. Our experiments highlight the ongoing challenges of using statistical +methods and/or LLMs for automatic abstraction identification and causal +discovery in NLP. Nonetheless, we demonstrate that the abstract causal +knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA +reasoning performance in LLMs. + +摘要:找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法,包括基於大型語言模型 (LLM) 的方法,由於規模有限且過度依賴於可用基準中的詞彙線索,在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖,作為因果知識的強健表示,其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中,我們介紹 \texttt{ACCESS},一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同,\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道,用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象,\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集,我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此,我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。 + +##### **Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality** +2502.09658v1 by Xin Kang, Veronika Shteingardt, Yuhan Wang, Dov Dori + +Knowledge representation and reasoning are critical challenges in Artificial +Intelligence (AI), particularly in integrating neural and symbolic approaches +to achieve explainable and transparent AI systems. Traditional knowledge +representation methods often fall short of capturing complex processes and +state changes. We introduce Neuro-Conceptual Artificial Intelligence (NCAI), a +specialization of the neuro-symbolic AI approach that integrates conceptual +modeling using Object-Process Methodology (OPM) ISO 19450:2024 with deep +learning to enhance question-answering (QA) quality. By converting natural +language text into OPM models using in-context learning, NCAI leverages the +expressive power of OPM to represent complex OPM elements-processes, objects, +and states-beyond what traditional triplet-based knowledge graphs can easily +capture. This rich structured knowledge representation improves reasoning +transparency and answer accuracy in an OPM-QA system. We further propose +transparency evaluation metrics to quantitatively measure how faithfully the +predicted reasoning aligns with OPM-based conceptual logic. Our experiments +demonstrate that NCAI outperforms traditional methods, highlighting its +potential for advancing neuro-symbolic AI by providing rich knowledge +representations, measurable transparency, and improved reasoning. + +摘要:知識表徵與推理是人工智慧 (AI) 中的重大挑戰,特別是在整合神經與符號方法以實現可解釋且透明的人工智慧系統時。傳統的知識表徵方法通常無法捕捉複雜的流程和狀態變化。我們引入了神經概念人工智慧 (NCAI),一種神經符號 AI 方法的專門化,它將使用物件流程方法 (OPM) ISO 19450:2024 的概念建模與深度學習整合在一起,以提升問答 (QA) 的品質。透過使用情境學習將自然語言文字轉換為 OPM 模型,NCAI 充分利用 OPM 的表達能力來表徵複雜的 OPM 元素(流程、物件和狀態),超越傳統的三元組知識圖表容易捕捉的範圍。這種豐富的結構化知識表徵改善了 OPM-QA 系統中的推理透明度和答案準確度。我們進一步提出了透明度評估指標,以量化測量預測推理與基於 OPM 的概念邏輯的吻合程度。我們的實驗證明,NCAI 優於傳統方法,突顯了它在透過提供豐富的知識表徵、可測量的透明度和改善的推理來推進神經符號 AI 的潛力。 + +##### **GCoT: Chain-of-Thought Prompt Learning for Graphs** +2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang + +Chain-of-thought (CoT) prompting has achieved remarkable success in natural +language processing (NLP). However, its vast potential remains largely +unexplored for graphs. This raises an interesting question: How can we design +CoT prompting for graphs to guide graph models to learn step by step? On one +hand, unlike natural languages, graphs are non-linear and characterized by +complex topological structures. On the other hand, many graphs lack textual +data, making it difficult to formulate language-based CoT prompting. In this +work, we propose the first CoT prompt learning framework for text-free graphs, +GCoT. Specifically, we decompose the adaptation process for each downstream +task into a series of inference steps, with each step consisting of +prompt-based inference, ``thought'' generation, and thought-conditioned prompt +learning. While the steps mimic CoT prompting in NLP, the exact mechanism +differs significantly. Specifically, at each step, an input graph, along with a +prompt, is first fed into a pre-trained graph encoder for prompt-based +inference. We then aggregate the hidden layers of the encoder to construct a +``thought'', which captures the working state of each node in the current step. +Conditioned on this thought, we learn a prompt specific to each node based on +the current state. These prompts are fed into the next inference step, +repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we +conduct comprehensive experiments on eight public datasets, which demonstrate +the advantage of our approach. + +摘要:鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而,其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題:我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習?一方面,與自然語言不同,圖形是非線性的,並且具有複雜的拓撲結構。另一方面,許多圖形缺乏文本數據,這使得難以制定基於語言的 CoT 提示。在這項工作中,我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說,我們將每個下游任務的適應過程分解為一系列推理步驟,每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示,但具體機制卻有很大不同。具體來說,在每一步中,一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後,我們聚合編碼器的隱藏層以構建一個「思想」,它捕獲了當前步驟中每個節點的工作狀態。基於這個思想,我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中,重複這個循環。為了評估和分析 GCoT 的有效性,我們對八個公共數據集進行了全面的實驗,這證明了我們方法的優勢。 + +##### **Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach** +2502.10453v1 by Régnier Avice, Bernhard Haslhofer, Zhidong Li, Jianlong Zhou + +Attribution tags form the foundation of modern cryptoasset forensics. +However, inconsistent or incorrect tags can mislead investigations and even +result in false accusations. To address this issue, we propose a novel +computational method based on Large Language Models (LLMs) to link attribution +tags with well-defined knowledge graph concepts. We implemented this method in +an end-to-end pipeline and conducted experiments showing that our approach +outperforms baseline methods by up to 37.4% in F1-score across three publicly +available attribution tag datasets. By integrating concept filtering and +blocking procedures, we generate candidate sets containing five knowledge graph +entities, achieving a recall of 93% without the need for labeled data. +Additionally, we demonstrate that local LLM models can achieve F1-scores of +90%, comparable to remote models which achieve 94%. We also analyze the +cost-performance trade-offs of various LLMs and prompt templates, showing that +selecting the most cost-effective configuration can reduce costs by 90%, with +only a 1% decrease in performance. Our method not only enhances attribution tag +quality but also serves as a blueprint for fostering more reliable forensic +evidence. -摘要:腦電圖訊號傳達了關於大腦活動的重要資訊,無論是在健康或病理狀況下。然而,它們本質上是有雜訊的,這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效,但通常需要大量的專家介入。本研究提出 LSTEEG,一種新穎的基於 LSTM 的自動編碼器,用於偵測和校正腦電圖訊號中的人工製品。利用深度學習,特別是 LSTM 層,LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比,LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性,讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域,並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。 +摘要:歸因標籤構成現代加密資產鑑識的基礎。 +然而,不一致或不正確的標籤會誤導調查,甚至導致錯誤的指控。為了解決這個問題,我們提出了一種基於大型語言模型 (LLM) 的新型計算方法,將歸因標籤與定義明確的知識圖譜概念連結起來。我們在端到端管道中實施了這種方法,並進行了實驗,結果顯示我們的做法在三個公開可用的歸因標籤資料集中,F1 分數比基線方法高出 37.4%。透過整合概念過濾和封鎖程序,我們生成了包含五個知識圖譜實體的候選集,在不需要標籤資料的情況下,達到了 93% 的召回率。 +此外,我們證明了本機 LLM 模型可以達到 90% 的 F1 分數,與達到 94% 的遠端模型相當。我們也分析了各種 LLM 和提示範本的成本效益權衡,結果顯示選擇最具成本效益的設定可以將成本降低 90%,而效能只下降 1%。我們的做法不僅提升了歸因標籤的品質,也作為促進更可靠鑑識證據的藍圖。 -##### **SycEval: Evaluating LLM Sycophancy** -2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo +##### **Deep Semantic Graph Learning via LLM based Node Enhancement** +2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen -Large language models (LLMs) are increasingly applied in educational, -clinical, and professional settings, but their tendency for sycophancy -- -prioritizing user agreement over independent reasoning -- poses risks to -reliability. This study introduces a framework to evaluate sycophantic behavior -in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and -MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19% -of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the -lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred -in 43.52% of cases, while regressive sycophancy, leading to incorrect answers, -was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher -sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$, -$p<0.001$), particularly in computational tasks, where regressive sycophancy -increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$). -Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while -citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$, -$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI: -[77.2%, 79.8%]) regardless of context or model. These findings emphasize the -risks and opportunities of deploying LLMs in structured and dynamic domains, -offering insights into prompt programming and model optimization for safer AI -applications. +Graph learning has attracted significant attention due to its widespread +real-world applications. Current mainstream approaches rely on text node +features and obtain initial node embeddings through shallow embedding learning +using GNNs, which shows limitations in capturing deep textual semantics. Recent +advances in Large Language Models (LLMs) have demonstrated superior +capabilities in understanding text semantics, transforming traditional text +feature processing. This paper proposes a novel framework that combines Graph +Transformer architecture with LLM-enhanced node features. Specifically, we +leverage LLMs to generate rich semantic representations of text nodes, which +are then processed by a multi-head self-attention mechanism in the Graph +Transformer to capture both local and global graph structural information. Our +model utilizes the Transformer's attention mechanism to dynamically aggregate +neighborhood information while preserving the semantic richness provided by LLM +embeddings. Experimental results demonstrate that the LLM-enhanced node +features significantly improve the performance of graph learning models on node +classification tasks. This approach shows promising results across multiple +graph learning tasks, offering a practical direction for combining graph +networks with language models. -摘要:大型語言模型(LLM)日益應用於教育、臨床和專業領域,但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為,涉及 AMPS(數學)和 MedQuad(醫療建議)數據集。在 58.19% 的案例中觀察到了趨炎附勢行為,其中 Gemini 表現出最高比率(62.47%),而 ChatGPT 最低(56.71%)。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中,而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率(61.75% 對 56.52%,Z=5.87,p<0.001),特別是在計算任務中,其中退步式趨炎附勢顯著增加(先發制人:8.13%,上下文:3.54%,p<0.001)。簡單的反駁最大化了漸進式趨炎附勢(Z=6.59,p<0.001),而基於引用的反駁表現出最高的退步式比率(Z=6.59,p<0.001)。趨炎附勢行為表現出很高的持續性(78.5%,95% CI:[77.2%,79.8%]),無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇,為更安全的 AI 應用提供了提示編程和模型優化的見解。 +摘要:圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵,並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入,這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力,轉換了傳統的文本特徵處理。本文提出了一種新的框架,將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說,我們利用 LLM 來生成文本節點的豐富語義表示,然後在圖形轉換器中由多頭自我注意機制處理,以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息,同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明,LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果,為將圖形網絡與語言模型相結合提供了實用的方向。 -##### **Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models** -2502.09659v1 by Hasin Rehana, Jie Zheng, Leo Yeh, Benu Bansal, Nur Bengisu Çam, Christianah Jemiyo, Brett McGregor, Arzucan Özgür, Yongqun He, Junguk Hur +##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping** +2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia -Motivation: An adjuvant is a chemical incorporated into vaccines that -enhances their efficacy by improving the immune response. Identifying adjuvant -names from cancer vaccine studies is essential for furthering research and -enhancing immunotherapies. However, the manual curation from the constantly -expanding biomedical literature poses significant challenges. This study -explores the automated recognition of vaccine adjuvant names using Large -Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) -and Large Language Model Meta AI (Llama). Methods: We utilized two datasets: 97 -clinical trial records from AdjuvareDB and 290 abstracts annotated with the -Vaccine Adjuvant Compendium (VAC). GPT-4o and Llama 3.2 were employed in -zero-shot and few-shot learning paradigms with up to four examples per prompt. -Prompts explicitly targeted adjuvant names, testing the impact of contextual -information such as substances or interventions. Outputs underwent automated -and manual validation for accuracy and consistency. Results: GPT-4o attained -100% Precision across all situations while exhibiting notable improve in Recall -and F1-scores, particularly with incorporating interventions. On the VAC -dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, -surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o -reached an F1-score of 81.67% for three-shot prompting with interventions, -surpassing Llama-3.2-3 B's maximum F1-score of 65.62%. Conclusion: Our findings -demonstrate that LLMs excel at identifying adjuvant names, including rare -variations of naming representation. This study emphasizes the capability of -LLMs to enhance cancer vaccine development by efficiently extracting insights. -Future work aims to broaden the framework to encompass various biomedical -literature and enhance model generalizability across various vaccines and -adjuvants. +The prototyping of computer games, particularly card games, requires +extensive human effort in creative ideation and gameplay evaluation. Recent +advances in Large Language Models (LLMs) offer opportunities to automate and +streamline these processes. However, it remains challenging for LLMs to design +novel game mechanics beyond existing databases, generate consistent gameplay +environments, and develop scalable gameplay AI for large-scale evaluations. +This paper addresses these challenges by introducing a comprehensive automated +card game prototyping framework. The approach highlights a graph-based indexing +method for generating novel game designs, an LLM-driven system for consistent +game code generation validated by gameplay records, and a gameplay AI +constructing method that uses an ensemble of LLM-generated action-value +functions optimized through self-play. These contributions aim to accelerate +card game prototyping, reduce human labor, and lower barriers to entry for game +developers. -摘要:動機:佐劑是一種加入疫苗的化學物質,能藉由改善免疫反應來提升疫苗的效力。從癌症疫苗研究中找出佐劑名稱對於推進研究和改善免疫療法至關重要。然而,從不斷擴展的生物醫學文獻中手動整理會造成重大挑戰。本研究探討使用大型語言模型 (LLM),特別是生成式預訓練Transformer (GPT) 和大型語言模型 Meta AI (Llama) 來自動辨識疫苗佐劑名稱。方法:我們使用兩個資料集:來自 AdjuvareDB 的 97 份臨床試驗記錄和 290 篇標註了疫苗佐劑彙編 (VAC) 的摘要。GPT-4o 和 Llama 3.2 被用於零次學習和少量學習範例,每個提示最多有四個範例。提示明確鎖定佐劑名稱,測試物質或介入措施等背景資訊的影響。輸出經過自動和手動驗證,以確保準確性和一致性。結果:GPT-4o 在所有情況下都達到 100% 的準確率,同時在召回率和 F1 分數上表現出顯著的進步,特別是在納入介入措施的情況下。在 VAC 資料集上,GPT-4o 在有介入措施的情況下達到 77.32% 的最高 F1 分數,比 Llama-3.2-3B 高出約 2%。在 AdjuvareDB 資料集上,GPT-4o 在有介入措施的三次提示中達到 81.67% 的 F1 分數,超過 Llama-3.2-3 B 的最高 F1 分數 65.62%。結論:我們的研究結果表明,LLM 在辨識佐劑名稱方面表現出色,包括命名表示的罕見變異。本研究強調了 LLM 在有效提取見解方面增強癌症疫苗開發的能力。未來的研究工作旨在擴大架構,涵蓋各種生物醫學文獻,並增強模型在各種疫苗和佐劑中的泛化能力。 +摘要:電腦遊戲,尤其是卡牌遊戲的原型製作,需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而,LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境,以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法,用於生成新穎的遊戲設計,一個由 LLM 驅動的系統,用於一致的遊戲程式碼生成,並由遊戲記錄驗證,以及一個遊戲 AI 構建方法,該方法使用由 LLM 生成的動作值函數的集合,通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作,減少人力,並降低遊戲開發人員的進入門檻。 -##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?** -2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace +##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units** +2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan -Medical research faces well-documented challenges in translating novel -treatments into clinical practice. Publishing incentives encourage researchers -to present "positive" findings, even when empirical results are equivocal. -Consequently, it is well-documented that authors often spin study results, -especially in article abstracts. Such spin can influence clinician -interpretation of evidence and may affect patient care decisions. In this -study, we ask whether the interpretation of trial results offered by Large -Language Models (LLMs) is similarly affected by spin. This is important since -LLMs are increasingly being used to trawl through and synthesize published -medical evidence. We evaluated 22 LLMs and found that they are across the board -more susceptible to spin than humans. They might also propagate spin into their -outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into -plain language summaries that they generate. We also find, however, that LLMs -are generally capable of recognizing spin, and can be prompted in a way to -mitigate spin's impact on LLM outputs. +Graph Neural Networks (GNNs) are vital for learning from graph-structured +data, enabling applications in network analysis, recommendation systems, and +speech analytics. Deploying them on edge devices like client PCs and laptops +enhances real-time processing, privacy, and cloud independence. GNNs aid +Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and +enable event-based vision tasks. However, irregular memory access, sparsity, +and dynamic structures cause high latency and energy overhead on +resource-constrained devices. While modern edge processors integrate CPUs, +GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular +GNN computations. We introduce GraNNite, the first hardware-aware framework +optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN +accelerators via a structured three-step methodology: (1) enabling NPU +execution, (2) optimizing performance, and (3) trading accuracy for efficiency +gains. Step 1 employs GraphSplit for workload distribution and StaGr for static +aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts +performance using EffOp for control-heavy tasks and GraSp for sparsity +exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce +redundancy and memory transfers. Step 3 balances quality versus efficiency, +where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate +attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, +GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to +8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher +performance than CPUs and GPUs, respectively, across GNN models. -摘要:醫學研究在將新穎療法轉化為臨床實務上,面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現,即使經驗結果模稜兩可。因此,有據可查的是,作者經常扭曲研究結果,特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋,並可能影響病患照護決策。在本研究中,我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據,因此這點非常重要。我們評估了 22 個 LLM,發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中:例如,我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而,我們也發現 LLM 通常有能力辨認扭曲,而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。 +摘要:圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要,能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置(例如用戶端電腦和筆電)上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG),並支援基於事件的視覺任務。然而,不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU,但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite,這是第一個硬體感知框架,透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行:(1) 啟用 NPU 執行,(2) 最佳化效能,以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配,並使用 StaGr 進行靜態聚合,而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能,並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率,其中 QuantGr 適用 INT8 量化,而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上,GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速,在 CPU 和 GPU 上實現了高達 8.6X 的能源增益,在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。 -##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating** -2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri +##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language** +2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin -This paper presents a novel Natural Language Processing (NLP) framework for -enhancing medical diagnosis through the integration of advanced techniques in -data augmentation, feature extraction, and classification. The proposed -approach employs back-translation to generate diverse paraphrased datasets, -improving robustness and mitigating overfitting in classification tasks. -Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with -Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained -contextual and positional relationships, dynamically adjusting the influence of -positional information based on semantic context to produce high-quality text -embeddings. For classification, an Attention-Based Feedforward Neural Network -(ABFNN) is utilized, effectively focusing on the most relevant features to -improve decision-making accuracy. Applied to the classification of symptoms, -clinical notes, and other medical texts, this architecture demonstrates its -ability to address the complexities of medical data. The combination of data -augmentation, contextual embedding generation, and advanced classification -mechanisms offers a robust and accurate diagnostic tool, with potential -applications in automated medical diagnosis and clinical decision support. This -method demonstrates the effectiveness of the proposed NLP framework for medical -diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of -99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only -underscore the model's robust performance in classifying medical texts with -exceptional precision and reliability but also highlight its superiority over -existing methods, making it a highly promising tool for automated diagnostic -systems. +Recent advancements in AI for biological research focus on integrating +molecular data with natural language to accelerate drug discovery. However, the +scarcity of high-quality annotations limits progress in this area. This paper +introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework +that leverages large language models to augment existing datasets, thereby +improving AI training. We demonstrate the effectiveness of LA$^3$ by creating +an enhanced dataset, LaChEBI-20, where we systematically rewrite the +annotations of molecules from an established dataset. These rewritten +annotations preserve essential molecular information while providing more +varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 +based on a benchmark architecture to learn the mapping between molecular +representations and augmented annotations. + Experimental results on text-based *de novo* molecule generation and molecule +captioning demonstrate that LaMolT5 outperforms state-of-the-art models. +Notably, incorporating LA$^3$ leads to improvements of up to 301% over the +benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ +notable applications in *image*, *text* and *graph* tasks, affirming its +versatility and utility. -摘要:本文提出了一個創新的自然語言處理 (NLP) 框架,透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集,提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa),這個模型捕捉細緻的脈絡和位置關係,根據語意脈絡動態調整位置資訊的影響,以產生高品質的文字嵌入。在分類方面,利用基於注意力的前饋神經網路 (ABFNN),有效地關注最相關的特徵,以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類,此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具,在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性,以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數,取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性,也突顯了它優於現有方法的優越性,使其成為自動化診斷系統中極具前景的工具。 +摘要:人工智慧在生物研究上的最新進展,專注於將分子資料與自然語言整合,以加速藥物發現。然而,高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$,一個基於語言的自動註解擴充框架,它利用大型語言模型來擴充現有的資料集,進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性,我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊,同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20,我們在基於基準架構上訓練 LaMolT5,以學習分子表示和擴充註解之間的對應。 +在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明,LaMolT5 優於最先進的模型。值得注意的是,納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外,我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性,肯定了它的多功能性和實用性。 -##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension** -2502.07752v2 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds +##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment** +2502.06472v1 by Yuxing Lu, Jinzhuo Wang -Designing efficient optimizers for large language models (LLMs) with -low-memory requirements and fast convergence is an important and challenging -problem. This paper makes a step towards the systematic design of such -optimizers through the lens of structured Fisher information matrix (FIM) -approximation. We show that many state-of-the-art efficient optimizers can be -viewed as solutions to FIM approximation (under the Frobenius norm) with -specific structural assumptions. Building on these insights, we propose two -design recommendations of practical efficient optimizers for LLMs, involving -the careful selection of structural assumptions to balance generality and -efficiency, and enhancing memory efficiency of optimizers with general -structures through a novel low-rank extension framework. We demonstrate how to -use each design approach by deriving new memory-efficient optimizers: Row and -Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation -(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the -effectiveness, showing faster and better convergence than existing -memory-efficient baselines and Adam with little memory overhead. Notably, Alice -achieves better than 2x faster convergence over Adam, while RACS delivers -strong performance on the 1B model with SGD-like memory. +Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical +for modern AI systems, but manual curation struggles to scale with the rapid +growth of scientific literature. This paper presents KARMA, a novel framework +employing multi-agent large language models (LLMs) to automate KG enrichment +through structured analysis of unstructured text. Our approach employs nine +collaborative agents, spanning entity discovery, relation extraction, schema +alignment, and conflict resolution that iteratively parse documents, verify +extracted knowledge, and integrate it into existing graph structures while +adhering to domain-specific schema. Experiments on 1,200 PubMed articles from +three different domains demonstrate the effectiveness of KARMA in knowledge +graph enrichment, with the identification of up to 38,230 new entities while +achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% +through multi-layer assessments. -摘要:設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的觀點,朝著系統化設計此類最佳化器邁出了一步。我們證明許多最先進的高效最佳化器可以視為 FIM 近似(在 Frobenius 範數下)的解,並具有特定的結構假設。基於這些見解,我們提出了 LLM 的兩個實用高效最佳化器設計建議,包括仔細選擇結構假設以平衡通用性和效率,以及透過新穎的低秩擴充框架增強一般結構最佳化器的記憶體效率。我們透過推導新的記憶體高效最佳化器來展示如何使用每種設計方法:列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練(高達 1B 參數)上的實驗驗證了其有效性,顯示比現有的記憶體高效基準和 Adam 更快且更好的收斂,且記憶體開銷很小。值得注意的是,Alice 的收斂速度比 Adam 快 2 倍以上,而 RACS 則在 1B 模型上提供類似 SGD 的記憶體的強勁效能。 +摘要:維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要,但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA,一個採用多代理大型語言模型 (LLM) 的新框架,透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理,涵蓋實體發現、關係提取、架構比對和衝突解決,這些代理會反覆分析文件、驗證提取的知識,並將其整合到現有的圖結構中,同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性,識別出多達 38,230 個新實體,同時達到 83.1% 的 LLM 驗證正確性,並透過多層評估將衝突邊緣降低了 18.6%。 -##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation** -2502.07516v2 by Raman Dutt +##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs** +2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang -Generative models, particularly text-to-image (T2I) diffusion models, play a -crucial role in medical image analysis. However, these models are prone to -training data memorization, posing significant risks to patient privacy. -Synthetic chest X-ray generation is one of the most common applications in -medical image analysis with the MIMIC-CXR dataset serving as the primary data -repository for this task. This study presents the first systematic attempt to -identify prompts and text tokens in MIMIC-CXR that contribute the most to -training data memorization. Our analysis reveals two unexpected findings: (1) -prompts containing traces of de-identification procedures (markers introduced -to hide Protected Health Information) are the most memorized, and (2) among all -tokens, de-identification markers contribute the most towards memorization. -This highlights a broader issue with the standard anonymization practices and -T2I synthesis with MIMIC-CXR. To exacerbate, existing inference-time -memorization mitigation strategies are ineffective and fail to sufficiently -reduce the model's reliance on memorized text tokens. On this front, we propose -actionable strategies for different stakeholders to enhance privacy and improve -the reliability of generative models in medical imaging. Finally, our results -provide a foundation for future work on developing and benchmarking -memorization mitigation techniques for synthetic chest X-ray generation using -the MIMIC-CXR dataset. The anonymized code is available at -https://anonymous.4open.science/r/diffusion_memorization-8011/ +Mitigating positional bias of language models (LMs) for listwise inputs is a +well-known and important problem (e.g., lost-in-the-middle). While zero-shot +order-invariant LMs have been proposed to solve this issue, their success on +practical listwise problems has been limited. In this work, as a first +contribution, we identify and overcome two limitations to make zero-shot +invariant LMs more practical: (1) training and inference distribution mismatch +arising from modifying positional ID assignments to enforce invariance, and (2) +failure to adapt to a mixture of order-invariant and sensitive inputs in +practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot +invariant LM for genuinely order-invariant inputs with minimal modifications of +positional IDs, and (2) Selective Routing, an adaptive framework that handles +both order-invariant and order-sensitive inputs in listwise tasks. On the Lost +in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU +benchmarks, we show that RoToR with Selective Routing can effectively handle +practical listwise input tasks in a zero-shot manner. -摘要:生成模型,尤其是文本到影像 (T2I) 擴散模型在醫學影像分析中扮演著至關重要的角色。然而,這些模型容易訓練資料記憶,對病患隱私構成重大風險。合成胸部 X 光影像生成是醫學影像分析中最常見的應用之一,而 MIMIC-CXR 資料集則作為此任務的主要資料儲存庫。本研究提出了第一個系統化的嘗試,以識別 MIMIC-CXR 中對訓練資料記憶貢獻最大的提示和文字代碼。我們的分析揭示了兩個出乎意料的發現:(1) 包含去識別程序痕跡的提示(用於隱藏受保護健康資訊的標記)是最容易被記憶的,以及 (2) 在所有代碼中,去識別標記對記憶的貢獻最大。這突顯了標準匿名化實務和使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。更糟的是,現有的推論時間記憶減緩策略無效,無法充分降低模型對記憶文字代碼的依賴。在這個方面,我們針對不同的利害關係人提出可行的策略,以增強隱私和改善生成模型在醫學影像中的可靠性。最後,我們的結果為未來開發和評量使用 MIMIC-CXR 資料集進行合成胸部 X 光影像生成的記憶減緩技術奠定了基礎。已匿名化的程式碼可在 https://anonymous.4open.science/r/diffusion_memorization-8011/ 取得。 +摘要:語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題(例如,迷失在中間)。雖然已經提出零次學習順序不變的 LM 來解決這個問題,但它們在實際列表問題上的成功卻很有限。在這項工作中,作為第一個貢獻,我們找出並克服了兩個限制,讓零次學習不變的 LM 更有實用性:(1) 訓練和推論分布不匹配,這是由於修改位置 ID 分配以強制不變性所造成的,以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題,我們提出 (1) RoToR,一個零次學習不變的 LM,用於真正不變的輸入,並對位置 ID 進行最小的修改,以及 (2) 選擇性路由,一個自適應框架,用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中,我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。 -##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level** -2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo +##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model** +2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen -Chronic kidney disease (CKD) is a major global health issue, affecting over -10% of the population and causing significant mortality. While kidney biopsy -remains the gold standard for CKD diagnosis and treatment, the lack of -comprehensive benchmarks for kidney pathology segmentation hinders progress in -the field. To address this, we organized the Kidney Pathology Image -Segmentation (KPIs) Challenge, introducing a dataset that incorporates -preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ -Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes -two tasks, patch-level segmentation and whole slide image segmentation and -detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. -By encouraging innovative segmentation methods that adapt to diverse CKD models -and tissue conditions, the KPIs Challenge aims to advance kidney pathology -analysis, establish new benchmarks, and enable precise, large-scale -quantification for disease research and diagnosis. +Recent advancements in large language models (LLMs) have significantly +improved various natural language processing (NLP) tasks. Typically, LLMs are +trained to predict the next token, aligning well with many NLP tasks. However, +in knowledge graph (KG) scenarios, entities are the fundamental units and +identifying an entity requires at least several tokens. This leads to a +granularity mismatch between KGs and natural languages. To address this issue, +we propose K-ON, which integrates KG knowledge into the LLM by employing +multiple head layers for next k-step prediction. K-ON can not only generate +entity-level results in one step, but also enables contrastive loss against +entities, which is the most powerful tool in KG representation learning. +Experimental results show that K-ON outperforms state-of-the-art methods that +incorporate text and even the other modalities. -摘要:慢性腎臟病 (CKD) 是全球主要的健康問題,影響超過 -10% 的人口,並造成顯著的死亡率。雖然腎臟活檢 -仍然是 CKD 診斷和治療的黃金標準,但缺乏 -腎臟病理學分割的全面基準阻礙了該領域的進展。 -為了解決這個問題,我們組織了腎臟病理影像 -分割 (KPIs) 挑戰,引入了包含超過 10,000 個註解的 -CKD 臨床前嚙齒動物模型的資料集,這些註解來自 60 多個 -週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括 -兩個任務,修補層級分割和全幻燈片影像分割和 -偵測,使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。 -通過鼓勵創新的分割方法來適應不同的 CKD 模型 -和組織條件,KPIs 挑戰旨在推進腎臟病理 -分析,建立新的基準,並實現精確、大規模的 -疾病研究和診斷量化。 +摘要:大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常,LLM 會接受訓練以預測下一個符號,這與許多 NLP 任務非常吻合。然而,在知識圖譜 (KG) 場景中,實體是基本單位,而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題,我們提出了 K-ON,它透過採用多個頭部層進行下一個 k 步預測,將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果,還能針對實體啟用對比損失,這是 KG 表示學習中最有力的工具。實驗結果顯示,K-ON 優於將文字甚至其他方式納入考量的最新方法。 -##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer** -2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu +##### **LegalViz: Legal Text Visualization by Text To Diagram Generation** +2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita -Early prediction of pediatric cardiac arrest (CA) is critical for timely -intervention in high-risk intensive care settings. We introduce PedCA-FT, a -novel transformer-based framework that fuses tabular view of EHR with the -derived textual view of EHR to fully unleash the interactions of -high-dimensional risk factors and their dynamics. By employing dedicated -transformer modules for each modality view, PedCA-FT captures complex temporal -and contextual patterns to produce robust CA risk estimates. Evaluated on a -curated pediatric cohort from the CHOA-CICU database, our approach outperforms -ten other artificial intelligence models across five key performance metrics -and identifies clinically meaningful risk factors. These findings underscore -the potential of multimodal fusion techniques to enhance early CA detection and -improve patient care. +Legal documents including judgments and court orders require highly +sophisticated legal knowledge for understanding. To disclose expert knowledge +for non-experts, we explore the problem of visualizing legal texts with +easy-to-understand diagrams and propose a novel dataset of LegalViz with 23 +languages and 7,010 cases of legal document and visualization pairs, using the +DOT graph description language of Graphviz. LegalViz provides a simple diagram +from a complicated legal corpus identifying legal entities, transactions, legal +sources, and statements at a glance, that are essential in each judgment. In +addition, we provide new evaluation metrics for the legal diagram visualization +by considering graph structures, textual similarities, and legal contents. We +conducted empirical studies on few-shot and finetuning large language models +for generating legal diagrams and evaluated them with these metrics, including +legal content-based evaluation within 23 languages. Models trained with +LegalViz outperform existing models including GPTs, confirming the +effectiveness of our dataset. -摘要:早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT,一個新穎的基於轉換器的框架,它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起,以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組,PedCA-FT 捕獲複雜的時間和上下文模式,以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估,我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型,並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。 +摘要:法律文件,包括判決和法院命令,需要高度專業的法律知識才能理解。為了向非專家揭露專家知識,我們探討了使用易於理解的圖表將法律文本視覺化的問題,並提出了一個新的 LegalViz 數據集,其中包含 23 種語言和 7,010 個法律文件和視覺化配對,使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表,可以一目了然地識別法律實體、交易、法律來源和陳述,這些在每項判決中都是必不可少的。此外,我們通過考慮圖形結構、文本相似性和法律內容,為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究,以生成法律圖表,並使用這些指標對它們進行了評估,包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型,包括 GPT,證實了我們數據集的有效性。 -##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals** -2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari +##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs** +2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee -Counterfactual explanations in medical imaging are critical for understanding -the predictions made by deep learning models. We extend the Latent Shift -counterfactual generation method from 2D applications to 3D computed tomography -(CT) scans. We address the challenges associated with 3D data, such as limited -training samples and high memory demands, by implementing a slice-based -approach. This method leverages a 2D encoder trained on CT slices, which are -subsequently combined to maintain 3D context. We demonstrate this technique on -two models for clinical phenotype prediction and lung segmentation. Our -approach is both memory-efficient and effective for generating interpretable -counterfactuals in high-resolution 3D medical imaging. +Mental-illness stigma is a persistent social problem, hampering both +treatment-seeking and recovery. Accordingly, there is a pressing need to +understand it more clearly, but analyzing the relevant data is highly +labor-intensive. Therefore, we designed a chatbot to engage participants in +conversations; coded those conversations qualitatively with AI assistance; and, +based on those coding results, built causal knowledge graphs to decode stigma. +The results we obtained from 1,002 participants demonstrate that conversation +with our chatbot can elicit rich information about people's attitudes toward +depression, while our AI-assisted coding was strongly consistent with +human-expert coding. Our novel approach combining large language models (LLMs) +and causal knowledge graphs uncovered patterns in individual responses and +illustrated the interrelationships of psychological constructs in the dataset +as a whole. The paper also discusses these findings' implications for HCI +researchers in developing digital interventions, decomposing human +psychological constructs, and fostering inclusive attitudes. -摘要:反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法,來解決與 3D 資料相關的挑戰,例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器,隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實,既節省記憶體又有效。 +摘要:精神疾病的污名化是一個持續存在的社會問題,阻礙了尋求治療和康復。因此,迫切需要更清楚地了解它,但分析相關數據非常費力。因此,我們設計了一個聊天機器人,讓參與者參與對話;使用 AI 協助對這些對話進行定性編碼;並根據這些編碼結果,構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明,與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊,而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式,並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。 -##### **Interactive Data Harmonization with LLM Agents** -2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire +##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification** +2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya -Data harmonization is an essential task that entails integrating datasets -from diverse sources. Despite years of research in this area, it remains a -time-consuming and challenging task due to schema mismatches, varying -terminologies, and differences in data collection methodologies. This paper -presents the case for agentic data harmonization as a means to both empower -experts to harmonize their data and to streamline the process. We introduce -Harmonia, a system that combines LLM-based reasoning, an interactive user -interface, and a library of data harmonization primitives to automate the -synthesis of data harmonization pipelines. We demonstrate Harmonia in a -clinical data harmonization scenario, where it helps to interactively create -reusable pipelines that map datasets to a standard format. Finally, we discuss -challenges and open problems, and suggest research directions for advancing our -vision. +In this paper, we address the task of semantic segmentation of legal +documents through rhetorical role classification, with a focus on Indian legal +judgments. We introduce LegalSeg, the largest annotated dataset for this task, +comprising over 7,000 documents and 1.4 million sentences, labeled with 7 +rhetorical roles. To benchmark performance, we evaluate multiple +state-of-the-art models, including Hierarchical BiLSTM-CRF, +TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and +Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an +instruction-tuned large language model. Our results demonstrate that models +incorporating broader context, structural relationships, and sequential +sentence information outperform those relying solely on sentence-level +features. Additionally, we conducted experiments using surrounding context and +predicted or actual labels of neighboring sentences to assess their impact on +classification accuracy. Despite these advancements, challenges persist in +distinguishing between closely related roles and addressing class imbalance. +Our work underscores the potential of advanced techniques for improving legal +document understanding and sets a strong foundation for future research in +legal NLP. -摘要:資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷,但由於架構不匹配、術語不同,以及資料收集方法的差異,它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和,作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia,一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統,以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia,它有助於互動式建立可重複使用的管線,將資料集對應至標準格式。最後,我們討論挑戰和開放性問題,並建議研究方向以推進我們的願景。 +摘要:在本文中,我們通過修辭角色分類來探討法律文件的語義分段任務,重點關注印度法律判決。我們引入了 LegalSeg,這是此任務中最大的註釋資料集,包含超過 7,000 份文件和 140 萬個句子,並標記了 7 個修辭角色。為了評量效能,我們評估了多個最先進的模型,包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer,以及探索性的 RhetoricLLaMA,一種經過指令調整的大型語言模型。我們的結果表明,結合廣泛背景、結構關係和順序句子資訊的模型,表現優於僅依賴句子層級特徵的模型。此外,我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗,以評估它們對分類精度的影響。儘管有這些進展,但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力,並為法律自然語言處理的未來研究奠定了堅實的基礎。 -##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML** -2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani +##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning** +2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong -Machine learning (ML) is transforming healthcare by enabling predictive -analytics, personalized treatments, and improved patient outcomes. However, -traditional ML workflows require specialized skills, infrastructure, and -resources, limiting accessibility for many healthcare professionals. This paper -explores how Google Cloud's BigQuery ML simplifies the development and -deployment of ML models using SQL, reducing technical barriers. Through a case -study on diabetes prediction using the Diabetes Health Indicators Dataset, we -evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep -Neural Network (DNN). Our results demonstrate that the Boosted Tree model -achieves the highest performance, making it highly effective for diabetes -prediction. This study highlights BigQuery ML's role in democratizing machine -learning by providing a scalable, efficient, and accessible solution for -healthcare analytics. +Developing intelligent agents for long-term cooperation in dynamic open-world +scenarios is a major challenge in multi-agent systems. Traditional Multi-agent +Reinforcement Learning (MARL) frameworks like centralized training +decentralized execution (CTDE) struggle with scalability and flexibility. They +require centralized long-term planning, which is difficult without custom +reward functions, and face challenges in processing multi-modal data. CTDE +approaches also assume fixed cooperation strategies, making them impractical in +dynamic environments where agents need to adapt and plan independently. To +address decentralized multi-agent cooperation, we propose Decentralized +Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in +a novel Multi-agent Crafter environment. Our generative agents, powered by +Large Language Models (LLMs), are more scalable than traditional MARL agents by +leveraging external knowledge and language for long-term planning and +reasoning. Instead of fully sharing information from all past experiences, +DAMCS introduces a multi-modal memory system organized as a hierarchical +knowledge graph and a structured communication protocol to optimize agent +cooperation. This allows agents to reason from past interactions and share +relevant information efficiently. Experiments on novel multi-agent open-world +tasks show that DAMCS outperforms both MARL and LLM baselines in task +efficiency and collaboration. Compared to single-agent scenarios, the two-agent +scenario achieves the same goal with 63% fewer steps, and the six-agent +scenario with 74% fewer steps, highlighting the importance of adaptive memory +and structured communication in achieving long-term goals. We publicly release +our project at: https://happyeureka.github.io/damcs. -摘要:機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果,正在轉型醫療保健。然而,傳統的 ML 工作流程需要專業技能、基礎設施和資源,限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署,降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究,我們評估了三個預測模型:邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明,提升樹模型達到了最高的效能,使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色,提供可擴充、有效率且可存取的醫療保健分析解決方案。 +摘要:在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架,例如集中式訓練去中心化執行 (CTDE),在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃,這在沒有自訂獎勵函數的情況下很難執行,並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略,這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題,我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援,透過利用外部知識和語言進行長期規劃和推理,比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊,而是引入了多模式記憶體系統,該系統組織成階層式知識圖譜和結構化通訊協定,以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明,DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比,雙重代理情境以少 63% 的步驟達成相同的目標,而六重代理情境則以少 74% 的步驟達成目標,突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於:https://happyeureka.github.io/damcs。 -##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements** -2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen +##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation** +2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang -Despite over a decade of legislative efforts to address modern slavery in the -supply chains of large corporations, the effectiveness of government oversight -remains hampered by the challenge of scrutinizing thousands of statements -annually. While Large Language Models (LLMs) can be considered a well -established solution for the automatic analysis and summarization of documents, -recognizing concrete modern slavery countermeasures taken by companies and -differentiating those from vague claims remains a challenging task. To help -evaluate and fine-tune LLMs for the assessment of corporate statements, we -introduce a dataset composed of 5,731 modern slavery statements taken from the -Australian Modern Slavery Register and annotated at the sentence level. This -paper details the construction steps for the dataset that include the careful -design of annotation specifications, the selection and preprocessing of -statements, and the creation of high-quality annotation subsets for effective -model evaluations. To demonstrate our dataset's utility, we propose a machine -learning methodology for the detection of sentences relevant to mandatory -reporting requirements set by the Australian Modern Slavery Act. We then follow -this methodology to benchmark modern language models under zero-shot and -supervised learning settings. +Graphs are able to model interconnected entities in many online services, +supporting a wide range of applications on the Web. This raises an important +question: How can we train a graph foundational model on multiple source +domains and adapt to an unseen target domain? A major obstacle is that graphs +from different domains often exhibit divergent characteristics. Some studies +leverage large language models to align multiple domains based on textual +descriptions associated with the graphs, limiting their applicability to +text-attributed graphs. For text-free graphs, a few recent works attempt to +align different feature distributions across domains, while generally +neglecting structural differences. In this work, we propose a novel Structure +Alignment framework for text-free Multi-domain Graph Pre-Training and +cross-domain adaptation (SAMGPT). It is designed to learn multi-domain +knowledge from graphs originating in multiple source domains, which can then be +adapted to address applications in an unseen target domain. Specifically, we +introduce a set of structure tokens to harmonize structure-based aggregation +across source domains during the pre-training phase. Next, for cross-domain +adaptation, we design dual prompts, namely, holistic prompts and specific +prompts, which adapt unified multi-domain structural knowledge and +fine-grained, domain-specific information, respectively, to a target domain. +Finally, we conduct comprehensive experiments on seven public datasets to +evaluate and analyze the effectiveness of SAMGPT. + +摘要:圖表能夠在許多線上服務中對相互關聯的實體進行建模, +支援網路上廣泛的應用程式。這提出了重要的問題:我們如何針對多個來源網域訓練圖表基礎模型,並適應未見過的目標網域?一個主要的障礙是,來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型,根據與圖表相關的文字描述,對齊多個網域,限制其適用性於有文字屬性的圖表。對於沒有文字的圖表,最近的一些作品嘗試對齊跨網域的不同特徵分佈,同時通常忽略結構上的差異。在這項工作中,我們提出了一個新的結構對齊框架,用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識,然後可以適應於未見過的目標網域中的應用程式。具體來說,我們引入了一組結構化代碼,以在預訓練階段,調和跨來源網域的基於結構的聚合。接下來,對於跨網域適應,我們設計了雙重提示,即整體提示和具體提示,分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後,我們在七個公共資料集上進行了全面的實驗,以評估和分析 SAMGPT 的有效性。 + +##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints** +2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang + +In-context learning (ICL) effectively conditions large language models (LLMs) +for molecular tasks, such as property prediction and molecule captioning, by +embedding carefully selected demonstration examples into the input prompt. This +approach avoids the computational overhead of extensive pertaining and +fine-tuning. However, current prompt retrieval methods for molecular tasks have +relied on molecule feature similarity, such as Morgan fingerprints, which do +not adequately capture the global molecular and atom-binding relationships. As +a result, these methods fail to represent the full complexity of molecular +structures during inference. Moreover, small-to-medium-sized LLMs, which offer +simpler deployment requirements in specialized systems, have remained largely +unexplored in the molecular ICL literature. To address these gaps, we propose a +self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context +learning, which aligns global molecular structures, represented by graph neural +networks (GNNs), with textual captions (descriptions) while leveraging local +feature similarity through Morgan fingerprints. In addition, we introduce a +Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to +optimize input prompt demonstration samples. Our experimental findings using +diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL +retrieval methods across all tasks by up to 45%. -摘要:儘管立法努力超過十年,旨在解決大型企業供應鏈中的現代奴隸制,但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型(LLM)可以被認為是文件自動分析和摘要的完善解決方案,但要辨識公司採取的具體現代奴隸制對策,並將其與含糊的聲明區分開來,仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明,我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集,這些聲明取自澳洲現代奴隸制註冊處,並在句子層級進行註解。本文詳細說明了資料集的建構步驟,其中包括註解規格的仔細設計、聲明的選擇和預處理,以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用,我們提出了一種機器學習方法,用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後,我們遵循這種方法,在零次學習和監督學習設定下對現代語言模型進行基準測試。 +摘要:情境學習 (ICL) 有效地調整大型語言模型 (LLM),以執行分子任務,例如屬性預測和分子標題,方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而,目前針對分子任務的提示檢索方法依賴於分子特徵相似性,例如 Morgan 指紋,而無法充分捕捉全局分子和原子鍵結關係。因此,這些方法無法在推理過程中表示分子結構的完整複雜性。此外,在專業系統中提供更簡單部署需求的小到中型的 LLM,在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距,我們提出了一種自我監督學習技術,GAMIC(圖形對齊分子情境學習),它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題(描述)對齊,同時透過 Morgan 指紋利用局部特徵相似性。此外,我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法,以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示,GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法,最多可達 45%。 -##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium** -2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour +##### **Knowledge Graph-Guided Retrieval Augmented Generation** +2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu -The fourth Machine Learning for Health (ML4H) symposium was held in person on -December 15th and 16th, 2024, in the traditional, ancestral, and unceded -territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, -British Columbia, Canada. The symposium included research roundtable sessions -to foster discussions between participants and senior researchers on timely and -relevant topics for the ML4H community. The organization of the research -roundtables at the conference involved 13 senior and 27 junior chairs across 13 -tables. Each roundtable session included an invited senior chair (with -substantial experience in the field), junior chairs (responsible for -facilitating the discussion), and attendees from diverse backgrounds with an -interest in the session's topic. +Retrieval-augmented generation (RAG) has emerged as a promising technology +for addressing hallucination issues in the responses generated by large +language models (LLMs). Existing studies on RAG primarily focus on applying +semantic-based approaches to retrieve isolated relevant chunks, which ignore +their intrinsic relationships. In this paper, we propose a novel Knowledge +Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes +knowledge graphs (KGs) to provide fact-level relationships between chunks, +improving the diversity and coherence of the retrieved results. Specifically, +after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG +employs a KG-guided chunk expansion process and a KG-based chunk organization +process to deliver relevant and important knowledge in well-organized +paragraphs. Extensive experiments conducted on the HotpotQA dataset and its +variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based +approaches, in terms of both response quality and retrieval quality. -摘要:第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議,以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席(在該領域擁有豐富的經驗)、初級主席(負責促進討論)以及對會議主題感興趣的來自不同背景的與會者。 +摘要:檢索增強生成 (RAG) 已成為一項有前途的技術,用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊,而忽略它們的內在關係。在本文中,我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架,它利用知識圖表 (KG) 來提供區塊之間的事實層級關係,從而提高檢索結果的多樣性和一致性。具體來說,在執行基於語義的檢索以提供種子區塊後,KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序,以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。 -##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering** -2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla +##### **Can Large Language Models Understand Intermediate Representations?** +2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan -Current Large Language Models (LLMs) benchmarks are often based on open-ended -or close-ended QA evaluations, avoiding the requirement of human labor. -Close-ended measurements evaluate the factuality of responses but lack -expressiveness. Open-ended capture the model's capacity to produce discourse -responses but are harder to assess for correctness. These two approaches are -commonly used, either independently or together, though their relationship -remains poorly understood. This work is focused on the healthcare domain, where -both factuality and discourse matter greatly. It introduces a comprehensive, -multi-axis suite for healthcare LLM evaluation, exploring correlations between -open and close benchmarks and metrics. Findings include blind spots and -overlaps in current methodologies. As an updated sanity check, we release a new -medical benchmark --CareQA-- with both open and closed variants. Finally, we -propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to -mitigate the identified limitations. +Intermediate Representations (IRs) are essential in compiler design and +program analysis, yet their comprehension by Large Language Models (LLMs) +remains underexplored. This paper presents a pioneering empirical study to +investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA +3.1, and Code Llama, in understanding IRs. We analyze their performance across +four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code +summarization, and execution reasoning. Our results indicate that while LLMs +demonstrate competence in parsing IR syntax and recognizing high-level +structures, they struggle with control flow reasoning, execution semantics, and +loop handling. Specifically, they often misinterpret branching instructions, +omit critical IR operations, and rely on heuristic-based reasoning, leading to +errors in CFG reconstruction, IR decompilation, and execution reasoning. The +study underscores the necessity for IR-specific enhancements in LLMs, +recommending fine-tuning on structured IR datasets and integration of explicit +control flow models to augment their comprehension and handling of IR-related +tasks. -摘要:當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量,避免了人力需求。封閉式測量評估回應的事實性,但缺乏表達力。開放式測量捕捉模型產生論述回應的能力,但較難評估正確性。這兩種方法通常獨立或合併使用,儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域,在該領域中,事實性和論述都非常重要。它引入了一個全面的多軸套件,用於醫療保健 LLM 評量,探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查,我們發布了一個新的醫療基準--CareQA--,包含開放式和封閉式變體。最後,我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。 +摘要:中間表徵 (IR) 在編譯器設計和程式分析中至關重要,但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究,以探討 LLM(包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama)理解 IR 的能力。我們分析了它們在四項任務中的表現:控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明,儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力,但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言,它們經常誤解分支指令、省略關鍵 IR 操作,並依賴於基於啟發式的推理,導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性,建議對結構化的 IR 資料集進行微調,並整合明確的控制流程模型,以增強其對 IR 相關任務的理解和處理。 -##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging** -2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra +##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?** +2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen -Accurate classification and anatomical localization are essential for -effective medical diagnostics and research, which may be efficiently performed -using deep learning techniques. However, availability of limited labeled data -poses a significant challenge. To address this, we adapted Prototypical -Networks and the Propagation-Reconstruction Network (PRNet) for few-shot -classification and localization, respectively, in Single Photon Emission -Computed Tomography (SPECT) images. For the proof of concept we used a -2D-sliced image cropped around heart. The Prototypical Network, with a -pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver -tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for -2D imaging with an encoder-decoder architecture and skip connections, achieved -a training loss of 1.395, accurately reconstructing patches and capturing -spatial relationships. These results highlight the potential of Prototypical -Networks for tissue classification with limited labeled data and PRNet for -anatomical landmark localization, paving the way for improved performance in -deep learning frameworks. +Long-context large language models (LLMs) have recently shown strong +performance in information retrieval and long-document QA. However, to tackle +the most challenging intellectual problems, LLMs must reason effectively in +long and complex contexts (e.g., frontier mathematical research). Studying how +LLMs handle increasing reasoning complexity and context length is essential, +yet existing benchmarks lack a solid basis for quantitative evaluation. +Inspired by the abstraction of GSM-8K problems as computational graphs, and the +ability to introduce noise by adding unnecessary nodes and edges, we develop a +grade school math problem generator capable of producing arithmetic problems +with infinite difficulty and context length under fine-grained control. Using +our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate +existing LLMs. We find a consistent sigmoid decline in reasoning performance as +complexity increases, along with a systematic inference scaling trend: +exponentially increasing inference computation yields only linear performance +gains. These findings underscore the fundamental limitations of current +long-context LLMs and the key challenges in scaling reasoning capabilities. Our +GSM-Infinite benchmark provides a scalable and controllable testbed for +systematically studying and advancing LLM reasoning in long and complex +contexts. -摘要:精確的分類和解剖定位對於有效的醫療診斷和研究至關重要,而這可以使用深度學習技術有效執行。然而,標記資料有限的取得會造成重大的挑戰。為了解決這個問題,我們分別調整了原型網路和傳播重建網路 (PRNet),用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念,我們使用圍繞心臟裁切的 2D 切片影像。原型網路,使用預先訓練的 ResNet-18 主幹,對心室、心肌和肝臟組織進行分類,訓練準確度為 96.67%,驗證準確度為 93.33%。PRNet,調整為使用編碼器解碼器架構和跳躍連接的 2D 影像,達到了 1.395 的訓練損失,精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力,以及 PRNet 在解剖標誌定位方面的潛力,為深度學習架構中效能的提升鋪平了道路。 +摘要:長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而,若要解決最具挑戰性的智力問題,LLM 必須在長且複雜的脈絡中有效推理(例如,前沿數學研究)。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要,但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發,以及透過加入不必要的節點和邊緣來引入雜訊的能力,我們開發了一個小學數學問題產生器,能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準,我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降,並伴隨著系統性的推論縮放趨勢:指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制,以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台,用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。 -##### **Illegal Waste Detection in Remote Sensing Images: A Case Study** -2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori +##### **Causality can systematically address the monsters under the bench(marks)** +2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf -Environmental crime currently represents the third largest criminal activity -worldwide while threatening ecosystems as well as human health. Among the -crimes related to this activity, improper waste management can nowadays be -countered more easily thanks to the increasing availability and decreasing cost -of Very-High-Resolution Remote Sensing images, which enable semi-automatic -territory scanning in search of illegal landfills. This paper proposes a -pipeline, developed in collaboration with professionals from a local -environmental agency, for detecting candidate illegal dumping sites leveraging -a classifier of Remote Sensing images. To identify the best configuration for -such classifier, an extensive set of experiments was conducted and the impact -of diverse image characteristics and training settings was thoroughly analyzed. -The local environmental agency was then involved in an experimental exercise -where outputs from the developed classifier were integrated in the experts' -everyday work, resulting in time savings with respect to manual -photo-interpretation. The classifier was eventually run with valuable results -on a location outside of the training area, highlighting potential for -cross-border applicability of the proposed pipeline. +Effective and reliable evaluation is essential for advancing empirical +machine learning. However, the increasing accessibility of generalist models +and the progress towards ever more complex, high-level tasks make systematic +evaluation more challenging. Benchmarks are plagued by various biases, +artifacts, or leakage, while models may behave unreliably due to poorly +explored failure modes. Haphazard treatments and inconsistent formulations of +such "monsters" can contribute to a duplication of efforts, a lack of trust in +results, and unsupported inferences. In this position paper, we argue causality +offers an ideal framework to systematically address these challenges. By making +causal assumptions in an approach explicit, we can faithfully model phenomena, +formulate testable hypotheses with explanatory power, and leverage principled +tools for analysis. To make causal model design more accessible, we identify +several useful Common Abstract Topologies (CATs) in causal graphs which help +gain insight into the reasoning abilities in large language models. Through a +series of case studies, we demonstrate how the precise yet pragmatic language +of causality clarifies the strengths and limitations of a method and inspires +new approaches for systematic progress. -摘要:環境犯罪目前是全球第三大犯罪活動,威脅生態系統和人類健康。在與此活動相關的犯罪中,不當廢物管理現在可以更容易地得到解決,這要歸功於超高解析度遙測影像越來越普及且成本下降,這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道,與當地環境機構的專業人士合作開發,用於檢測候選非法傾倒地點,利用遙測影像分類器。為了找出這種分類器的最佳配置,進行了一系列廣泛的實驗,並徹底分析了不同影像特徵和訓練設定的影響。然後,當地環境機構參與了一項實驗練習,其中將已開發分類器的輸出整合到專家的日常工作中,從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器,獲得了有價值的結果,突出了所提出管道的跨境適用性潛力。 +摘要:有效的、可靠的評估對於推進經驗機器學習至關重要。然而,一般化模型的可及性日益提高,以及朝著更複雜、更高級別任務的進展,使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾,而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中,我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設,我們可以忠實地模擬現象,制定具有解釋力的可測試假設,並利用原則性的分析工具。為了使因果模型設計更易於使用,我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT),有助於深入了解大型語言模型中的推理能力。通過一系列案例研究,我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點,並激發系統進展的新方法。 -##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model** -2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li +##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures** +2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha -Accurate and efficient electroencephalography (EEG) analysis is essential for -detecting seizures and artifacts in long-term monitoring, with applications -spanning hospital diagnostics to wearable health devices. Robust EEG analytics -have the potential to greatly improve patient care. However, traditional deep -learning models, especially Transformer-based architectures, are hindered by -their quadratic time and memory complexity, making them less suitable for -resource-constrained environments. To address these challenges, we present -FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel -self-supervised framework that establishes new efficiency benchmarks for EEG -analysis through bidirectional state-space modeling. Unlike Transformer-based -models, which incur quadratic time and memory complexity, FEMBA scales linearly -with sequence length, enabling more scalable and efficient processing of -extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and -fine-tuned on three downstream tasks, FEMBA achieves competitive performance in -comparison with transformer models, with significantly lower computational -cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB -and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates -viability for resource-constrained devices. These results pave the way for -scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as -a promising candidate for wearable applications. +Large Language Models (LLMs) have demonstrated impressive reasoning +capabilities, yet their performance is highly dependent on the prompting +strategy and model scale. While reinforcement learning and fine-tuning have +been deployed to boost reasoning, these approaches incur substantial +computational and data overhead. In this work, we introduce Adaptive Graph of +Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM +reasoning solely at test time. Rather than relying on fixed-step methods like +Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes +complex queries into structured subproblems, forming an dynamic directed +acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding +only those subproblems that require further analysis, AGoT unifies the +strengths of chain, tree, and graph paradigms into a cohesive framework that +allocates computation where it is most needed. We validate our approach on +diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and +mathematical problem-solving, achieving up to 46.2% improvement on scientific +reasoning tasks (GPQA) - comparable to gains achieved through computationally +intensive reinforcement learning approaches and outperforming state-of-the-art +iterative approaches. These results suggest that dynamic decomposition and +structured recursion offer a scalable, cost-effective alternative to +post-training modifications, paving the way for more robust, general-purpose +reasoning in LLMs. -摘要:準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要,其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而,傳統深度學習模型,特別是基於 Transformer 的架構,受到其二次時間和記憶體複雜度的阻礙,使其不太適合資源受限的環境。為了應對這些挑戰,我們提出 FEMBA (基礎 EEG Mamba + 雙向架構),一種創新的自我監督架構,透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同,FEMBA 隨著序列長度線性縮放,支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調,與Transformer模型相比,在計算成本顯著降低的情況下,實現了具有競爭力的效能。具體來說,它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC,而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路,並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。 +摘要:大型語言模型 (LLM) 已展現令人印象深刻的推理能力,但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理,但這些方法會造成大量的運算和資料開銷。在這項工作中,我們引入了「適應性思考圖」(AGoT),一個動態的、基於圖形的推論架構,它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法,而是遞迴地將複雜的查詢分解成結構化的子問題,形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題,AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中,將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法,在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進,這與透過運算密集的強化學習方法所獲得的增益相當,並且優於最先進的迭代方法。這些結果表明,動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案,用於訓練後修改,為 LLM 中更強健、更通用的推理鋪平了道路。 -##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?** -2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham +##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics** +2502.05239v1 by Hussam Ghanem, Christophe Cruz -The advent of foundation models (FMs) is transforming medical domain. In -ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 -million natural images and 1.6 million retinal images, has demonstrated high -adaptability across clinical applications. Conversely, DINOv2, a -general-purpose vision FM pre-trained on 142 million natural images, has shown -promise in non-medical domains. However, its applicability to clinical tasks -remains underexplored. To address this, we conducted head-to-head evaluations -by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular -disease detection and systemic disease prediction tasks, across eight -standardized open-source ocular datasets, as well as the Moorfields AlzEye and -the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting -diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, -all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In -glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, -P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 -models in predicting heart failure, myocardial infarction, and ischaemic stroke -(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even -with 10% of the fine-tuning data. These findings showcase the distinct -scenarios where general-purpose and domain-specific FMs excel, highlighting the -importance of aligning FM selection with task-specific requirements to optimise -clinical performance. +Recent advancements in large language models have demonstrated significant +potential in the automated construction of knowledge graphs from unstructured +text. This paper builds upon our previous work [16], which evaluated various +models using metrics like precision, recall, F1 score, triple matching, and +graph matching, and introduces a refined approach to address the critical +issues of hallucination and omission. We propose an enhanced evaluation +framework incorporating BERTScore for graph similarity, setting a practical +threshold of 95% for graph matching. Our experiments focus on the Mistral +model, comparing its original and fine-tuned versions in zero-shot and few-shot +settings. We further extend our experiments using examples from the KELM-sub +training dataset, illustrating that the fine-tuned model significantly improves +knowledge graph construction accuracy while reducing the exact hallucination +and omission. However, our findings also reveal that the fine-tuned models +perform worse in generalization tasks on the KELM-sub dataset. This study +underscores the importance of comprehensive evaluation metrics in advancing the +state-of-the-art in knowledge graph construction from textual data. -摘要:基礎模型 (FM) 的出現正在轉變醫療領域。在眼科,RETFound 是一個視網膜專用 FM,依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練,已展現出高度適應性,可應用於各種臨床應用。相反地,DINOv2 是一個通用視覺 FM,使用 1.42 億張自然影像進行預訓練,已展現出在非醫療領域的潛力。然而,其在臨床任務中的適用性仍未被充分探索。為了解決這個問題,我們針對眼部疾病偵測和全身性疾病預測任務,對 RETFound 和三個 DINOv2 模型(大型、基礎、小型)進行微調,並進行一對一的評估,使用八個標準化的開源眼科資料集,以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound(三個資料集的 AUROC=0.850-0.952,相較於 0.823-0.944,所有 P<=0.007)和多類眼部疾病(AUROC=0.892,相較於 0.846,P<0.001)。在青光眼方面,DINOv2 基礎模型優於 RETFound(AUROC=0.958,相較於 0.940,P<0.001)。相反地,RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型(AUROC=0.732-0.796,相較於 0.663-0.771,所有 P<0.001)。即使使用 10% 的微調資料,這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景,突顯了根據任務特定需求調整 FM 選擇,以最佳化臨床表現的重要性。 +摘要:大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上,該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型,並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架,結合 BERTScore 來進行圖形相似性,並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上,比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗,說明微調後的模型顯著提高了知識圖譜建構的準確度,同時減少了精確的幻覺和遺漏。然而,我們的研究結果也顯示,微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。 -##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning** -2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun +##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research** +2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu -Medical time series are often irregular and face significant missingness, -posing challenges for data analysis and clinical decision-making. Existing -methods typically adopt a single modeling perspective, either treating series -data as sequences or transforming them into image representations for further -classification. In this paper, we propose a joint learning framework that -incorporates both sequence and image representations. We also design three -self-supervised learning strategies to facilitate the fusion of sequence and -image representations, capturing a more generalizable joint representation. The -results indicate that our approach outperforms seven other state-of-the-art -models in three representative real-world clinical datasets. We further -validate our approach by simulating two major types of real-world missingness -through leave-sensors-out and leave-samples-out techniques. The results -demonstrate that our approach is more robust and significantly surpasses other -baselines in terms of classification performance. +We introduce Agentic Reasoning, a framework that enhances large language +model (LLM) reasoning by integrating external tool-using agents. Unlike +conventional LLM-based reasoning approaches, which rely solely on internal +inference, Agentic Reasoning dynamically engages web search, code execution, +and structured reasoning-context memory to solve complex problems requiring +deep research and multi-step logical deduction. Our framework introduces the +Mind Map agent, which constructs a structured knowledge graph to track logical +relationships, improving deductive reasoning. Additionally, the integration of +web-search and coding agents enables real-time retrieval and computational +analysis, enhancing reasoning accuracy and decision-making. Evaluations on +PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks +demonstrate that our approach significantly outperforms existing models, +including leading retrieval-augmented generation (RAG) systems and +closed-source LLMs. Moreover, our results indicate that agentic reasoning +improves expert-level knowledge synthesis, test-time scalability, and +structured problem-solving. The code is at: +https://github.com/theworldofagents/Agentic-Reasoning. -摘要:醫療時間序列通常不規則且會面臨顯著的缺失,對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點,將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中,我們提出了一個聯合學習架構,結合序列和影像表示。我們還設計了三種自我監督學習策略,以促進序列和影像表示的融合,捕捉更具概括性的聯合表示。結果表明,我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明,我們的做法更強大,並且在分類效能方面顯著優於其他基準。 +摘要:我們引入了代理推理,一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同,代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理,它建立一個結構化的知識圖譜來追蹤邏輯關係,改善演繹推理。此外,整合網路搜尋和編碼代理能進行即時擷取和運算分析,增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示,我們的做法明顯優於現有模型,包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外,我們的結果顯示,代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在:https://github.com/theworldofagents/Agentic-Reasoning。 -##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** -2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek +##### **Position-aware Automatic Circuit Discovery** +2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov -We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), -an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS -predicts future PHTs using transformer-based architectures. The Adaptive Risk -Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk -probabilities for clinician-defined critical events. ARES incorporates a -personalized explainability module that identifies key clinical factors -influencing risk estimates for individual patients. ARES was evaluated on the -MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its -performance against traditional early warning systems and machine learning -models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, -with 60% including hospital admissions. The dataset contained over 357 million -tokens. ETHOS outperformed benchmark models in predicting hospital admissions, -ICU admissions, and prolonged hospital stays, achieving superior AUC scores. -ETHOS-based risk estimates demonstrated robustness across demographic subgroups -with strong model reliability, confirmed via calibration curves. The -personalized explainability module provides insights into patient-specific -factors contributing to risk. ARES, powered by ETHOS, advances predictive -healthcare AI by providing dynamic, real-time, and personalized risk estimation -with patient-specific explainability to enhance clinician trust. Its -adaptability and superior accuracy position it as a transformative tool for -clinical decision-making, potentially improving patient outcomes and resource -allocation in emergency and inpatient settings. We release the full code at -github.com/ipolharvard/ethos-ares to facilitate future research. +A widely used strategy to discover and understand language model mechanisms +is circuit analysis. A circuit is a minimal subgraph of a model's computation +graph that executes a specific task. We identify a gap in existing circuit +discovery methods: they assume circuits are position-invariant, treating model +components as equally relevant across input positions. This limits their +ability to capture cross-positional interactions or mechanisms that vary across +positions. To address this gap, we propose two improvements to incorporate +positionality into circuits, even on tasks containing variable-length examples. +First, we extend edge attribution patching, a gradient-based method for circuit +discovery, to differentiate between token positions. Second, we introduce the +concept of a dataset schema, which defines token spans with similar semantics +across examples, enabling position-aware circuit discovery in datasets with +variable length examples. We additionally develop an automated pipeline for +schema generation and application using large language models. Our approach +enables fully automated discovery of position-sensitive circuits, yielding +better trade-offs between circuit size and faithfulness compared to prior work. -摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), -一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS -使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 +摘要:廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖,可執行特定任務。我們找出電路發現方法中的一個缺口:它們假設電路與位置無關,將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口,我們提出兩項改進,將位置性納入電路中,即使在包含變長範例的任務中也是如此。首先,我們擴充邊緣屬性修補,一種基於梯度的電路發現方法,以區分符號位置。其次,我們引入了資料集架構的概念,它定義了在範例中具有類似語義的符號跨距,使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外,我們開發了一個自動化管線,用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化,與先前的研究相比,在電路大小和忠實度之間產生了更好的權衡。 -##### **Can ChatGPT Diagnose Alzheimer's Disease?** -2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin +##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems** +2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister -Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating -neurodegenerative condition that affects approximately 1 in 9 individuals aged -65 and older, profoundly impairing memory and cognitive function. This paper -utilises 9300 electronic health records (EHRs) with data from Magnetic -Resonance Imaging (MRI) and cognitive tests to address an intriguing question: -As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs? -We present an in-depth evaluation of ChatGPT using a black-box approach with -zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to -analyse MRI and cognitive test results, as well as its potential as a -diagnostic tool for AD. By automating aspects of the diagnostic process, this -research opens a transformative approach for the healthcare system, -particularly in addressing disparities in resource-limited regions where AD -specialists are scarce. Hence, it offers a foundation for a promising method -for early detection, supporting individuals with timely interventions, which is -paramount for Quality of Life (QoL). +We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by +jointly optimizing model roles and weights. We represent multi-LLM systems as +directed acyclic graphs (DAGs) of LLMs with topological message passing for +collaborative generation. Given a pool of LLM experts and a utility function, +Heterogeneous Swarms employs two iterative steps: role-step and weight-step. +For role-step, we interpret model roles as learning a DAG that specifies the +flow of inputs and outputs between LLMs. Starting from a swarm of random +continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs +in topological order, evaluate on the utility function (e.g. accuracy on a +task), and optimize the adjacency matrices with particle swarm optimization +based on the utility score. For weight-step, we assess the contribution of +individual LLMs in the multi-LLM systems and optimize model weights with swarm +intelligence. We propose JFK-score to quantify the individual contribution of +each LLM in the best-found DAG of the role-step, then optimize model weights +with particle swarm optimization based on the JFK-score. Experiments +demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based +baselines by 18.5% on average across 12 tasks. Further analysis reveals that +Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles +and substantial collaborative gains, and benefits from the diversity of +language models. -摘要:ChatGPT 能否診斷出阿茲海默症 (AD)?AD 是一種毀滅性的神經退化性疾病,影響約 1/9 的 65 歲及以上人士,嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR),其中包含磁共振成像 (MRI) 和認知測試的數據,來解決一個有趣的問題:作為一個通用任務解決器,ChatGPT 能否使用 EHR 準確地檢測出 AD?我們使用黑盒方法對 ChatGPT 進行了深入評估,採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力,以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面,這項研究為醫療保健系統開啟了一種變革性的方法,特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此,它為一種有希望的早期檢測方法奠定了基礎,通過及時干預來支持個人,這對於生活品質 (QoL) 至關重要。 +摘要:我們提出異質群體,一種演算法,透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG),並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數,異質群體使用兩個反覆步驟:角色步驟和權重步驟。對於角色步驟,我們將模型角色解釋為學習一個 DAG,它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始,我們將它們解碼為離散 DAG,以拓撲順序呼叫 LLM,根據效用函數(例如任務的準確度)進行評估,並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟,我們評估個別 LLM 在多 LLM 系統中的貢獻,並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻,然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明,異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明,異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統,並受益於語言模型的多樣性。 -##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking** -2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares +##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot** +2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao -EEG-based neural networks, pivotal in medical diagnosis and brain-computer -interfaces, face significant intellectual property (IP) risks due to their -reliance on sensitive neurophysiological data and resource-intensive -development. Current watermarking methods, particularly those using abstract -trigger sets, lack robust authentication and fail to address the unique -challenges of EEG models. This paper introduces a cryptographic wonder -filter-based watermarking framework tailored for EEG-based neural networks. -Leveraging collision-resistant hashing and public-key encryption, the wonder -filter embeds the watermark during training, ensuring minimal distortion ($\leq -5\%$ drop in EEG task accuracy) and high reliability (100\% watermark -detection). The framework is rigorously evaluated against adversarial attacks, -including fine-tuning, transfer learning, and neuron pruning. Results -demonstrate persistent watermark retention, with classification accuracy for -watermarked states remaining above 90\% even after aggressive pruning, while -primary task performance degrades faster, deterring removal attempts. Piracy -resistance is validated by the inability to embed secondary watermarks without -severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic -hashing ensures authentication, reducing brute-force attack success -probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet, -TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively -eliminating false positives. By integrating wonder filters with EEG-specific -adaptations, this work bridges a critical gap in IP protection for -neurophysiological models, offering a secure, tamper-proof solution for -healthcare and biometric applications. The framework's robustness against -adversarial modifications underscores its potential to safeguard sensitive EEG -models while maintaining diagnostic utility. +Retrieval-augmented generation (RAG) is a well-suited technique for +retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a +key module of the healthcare copilot, helping reduce misdiagnosis for +healthcare practitioners and patients. However, the diagnostic accuracy and +specificity of existing heuristic-based RAG models used in the medical domain +are inadequate, particularly for diseases with similar manifestations. This +paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited +reasoning for the medical domain that retrieves diagnosis and treatment +recommendations based on manifestations. MedRAG systematically constructs a +comprehensive four-tier hierarchical diagnostic KG encompassing critical +diagnostic differences of various diseases. These differences are dynamically +integrated with similar EHRs retrieved from an EHR database, and reasoned +within a large language model. This process enables more accurate and specific +decision support, while also proactively providing follow-up questions to +enhance personalized medical decision-making. MedRAG is evaluated on both a +public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) +collected from Tan Tock Seng Hospital, and its performance is compared against +various existing RAG methods. Experimental results show that, leveraging the +information integration and relational abilities of the KG, our MedRAG provides +more specific diagnostic insights and outperforms state-of-the-art models in +reducing misdiagnosis rates. Our code will be available at +https://github.com/SNOWTEAM2023/MedRAG -摘要:基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要,由於其依賴敏感的神經生理資料和資源密集型的開發,面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法,特別是那些使用抽象觸發集的方法,缺乏強健的驗證,且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密,wonder 濾波器在訓練期間嵌入浮水印,確保最小的失真(EEG 任務準確度下降 $\leq 5\%$)和高可靠性(100% 浮水印檢測)。該架構針對對抗性攻擊進行了嚴格的評估,包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留,即使在激進的剪枝後,浮水印狀態的分類準確度仍保持在 90% 以上,而主要任務的性能下降得更快,阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證,而不會造成嚴重的準確度損失(在 EEGNet 和 CCNN 模型中 $>10\%$)。密碼學雜湊確保驗證,降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型(CCNN、EEGNet、TSception)進行評估,該方法達到了 $>99.4\%$ 的空嵌入準確度,有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合,這項工作彌補了神經生理模型 IP 保護中的關鍵差距,為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。 +摘要:檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組,協助減少醫療保健從業人員和患者的誤診。然而,在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足,特別是對於具有類似表現的疾病。本文提出 MedRAG,一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型,用於醫療領域,它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG,涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合,並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援,同時主動提供後續問題,以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估,並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示,利用 KG 的資訊整合和關係能力,我們的 MedRAG 提供了更具體的診斷見解,並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供 -##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models** -2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen +##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering** +2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck -Depression is one of the leading causes of disability worldwide, posing a -severe burden on individuals, healthcare systems, and society at large. Recent -advancements in Large Language Models (LLMs) have shown promise in addressing -mental health challenges, including the detection of depression through -text-based analysis. However, current LLM-based methods often struggle with -nuanced symptom identification and lack a transparent, step-by-step reasoning -process, making it difficult to accurately classify and explain mental health -conditions. To address these challenges, we propose a Chain-of-Thought -Prompting approach that enhances both the performance and interpretability of -LLM-based depression detection. Our method breaks down the detection process -into four stages: (1) sentiment analysis, (2) binary depression classification, -(3) identification of underlying causes, and (4) assessment of severity. By -guiding the model through these structured reasoning steps, we improve -interpretability and reduce the risk of overlooking subtle clinical indicators. -We validate our method on the E-DAIC dataset, where we test multiple -state-of-the-art large language models. Experimental results indicate that our -Chain-of-Thought Prompting technique yields superior performance in both -classification accuracy and the granularity of diagnostic insights, compared to -baseline approaches. +Most existing Knowledge Graph Question Answering (KGQA) approaches are +designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the +heterogeneity of the underlying graph schema, topology and assertions, most +KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without +resource-intensive training data. We present OntoSCPrompt, a novel Large +Language Model (LLM)-based KGQA approach with a two-stage architecture that +separates semantic parsing from KG-dependent interactions. OntoSCPrompt first +generates a SPARQL query structure (including SPARQL keywords such as SELECT, +ASK, WHERE and placeholders for missing tokens) and then fills them with +KG-specific information. To enhance the understanding of the underlying KG, we +present an ontology-guided, hybrid prompt learning strategy that integrates KG +ontology into the learning process of hybrid prompts (e.g., discrete and +continuous vectors). We also present several task-specific decoding strategies +to ensure the correctness and executability of generated SPARQL queries in both +stages. Experimental results demonstrate that OntoSCPrompt performs as well as +SOTA approaches without retraining on a number of KGQA datasets such as CWQ, +WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well +to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: +\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} -摘要:憂鬱症是全球殘障的主要原因之一,對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望,包括透過基於文字的分析來偵測憂鬱症。然而,現有的基於 LLM 的方法通常難以辨識細微的症狀,而且缺乏透明且逐步的推理過程,這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰,我們提出了一種思考鏈提示方法,它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段:(1) 情緒分析,(2) 二元憂鬱症分類,(3) 找出潛在原因,以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟,我們提升了可解釋性,並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法,並在其中測試了多種最先進的大型語言模型。實驗結果顯示,與基線方法相比,我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。 +摘要:現有的知識圖譜問答(KGQA)方法大多是為特定 KG 而設計的,例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性,大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜(KG)。我們提出 OntoSCPrompt,這是一種基於大型語言模型(LLM)的新型 KGQA 方法,採用兩階段架構,將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構(包括 SPARQL 關鍵字,例如 SELECT、ASK、WHERE 和缺失令牌的佔位符),然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解,我們提出了一種由本体指導的混合提示學習策略,將 KG 本体整合到混合提示(例如,離散和連續向量)的學習過程中。我們還提出了多種特定任務的解碼策略,以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明,OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時,效能與 SOTA 方法一樣好,且資源使用效率高,並且可以很好地概括到未見過的特定領域 KG,例如 DBLP-QuAD 和 CoyPu KG Code: +\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} -##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison** -2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis +##### **Multimodal Medical Code Tokenizer** +2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik -The increasing volume of drug combinations in modern therapeutic regimens -needs reliable methods for predicting drug-drug interactions (DDIs). While -Large Language Models (LLMs) have revolutionized various domains, their -potential in pharmaceutical research, particularly in DDI prediction, remains -largely unexplored. This study thoroughly investigates LLMs' capabilities in -predicting DDIs by uniquely processing molecular structures (SMILES), target -organisms, and gene interaction data as raw text input from the latest DrugBank -dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4, -Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first -assessing their zero-shot capabilities in DDI prediction. We then fine-tuned -selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1 -distilled Qwen 1.5B) to optimize their performance. Our comprehensive -evaluation framework included validation across 13 external DDI datasets, -comparing against traditional approaches such as l2-regularized logistic -regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5 -2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of -0.919 on balanced datasets (50% positive, 50% negative cases). This result -represents an improvement over both zero-shot predictions and state-of-the-art -machine-learning methods used for DDI prediction. Our analysis reveals that -LLMs can effectively capture complex molecular interaction patterns and cases -where drug pairs target common genes, making them valuable tools for practical -applications in pharmaceutical research and clinical settings. +Foundation models trained on patient electronic health records (EHRs) require +tokenizing medical data into sequences of discrete vocabulary items. Existing +tokenizers treat medical codes from EHRs as isolated textual tokens. However, +each medical code is defined by its textual description, its position in +ontological hierarchies, and its relationships to other codes, such as disease +co-occurrences and drug-treatment associations. Medical vocabularies contain +more than 600,000 codes with critical information for clinical reasoning. We +introduce MedTok, a multimodal medical code tokenizer that uses the text +descriptions and relational context of codes. MedTok processes text using a +language model encoder and encodes the relational structure with a graph +encoder. It then quantizes both modalities into a unified token space, +preserving modality-specific and cross-modality information. We integrate +MedTok into five EHR models and evaluate it on operational and clinical tasks +across in-patient and out-patient datasets, including outcome prediction, +diagnosis classification, drug recommendation, and risk stratification. +Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR +models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with +the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate +using MedTok tokenizer with medical QA systems. Our results demonstrate the +potential of MedTok as a unified tokenizer for medical codes, improving +tokenization for medical foundation models. -摘要:現代治療方案中藥物組合的數量越來越多,需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命,它們在藥物研究中的潛力,特別是在 DDI 預測中的潛力,仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入,徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM,包括專有模型(GPT-4、Claude、Gemini)和開源變體(從 1.5B 到 72B 參數),首先評估它們在 DDI 預測中的零次學習能力。然後,我們微調選定的模型(GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B)以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證,並與傳統方法(例如 l2 正則化邏輯迴歸)進行比較。微調後的 LLM 表現出優異的效能,其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度,在平衡資料集(50% 正例,50% 反例)上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明,LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況,使其成為藥物研究和臨床環境中實用應用的寶貴工具。 +摘要:在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而,每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系(例如疾病共现和药物治疗关联)来定义。医学词汇表包含超过 600,000 个代码,这些代码包含临床推理的关键信息。我们引入了 MedTok,这是一种多模态医学代码标记器,它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本,并使用图编码器对关系结构进行编码。然后,它将这两种模态量化为一个统一的标记空间,保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中,并在住院和门诊数据集(包括结果预测、诊断分类、药物推荐和风险分层)上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC,在 MIMIC-III 上提高 4.10%,在 MIMIC-IV 上提高 4.78%,在 EHRShot 上提高 11.30%,其中药物推荐的增益最大。除了 EHR 建模之外,我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力,改进了医学基础模型的标记化。 -##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)** -2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh +##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents** +2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu -Detecting sensitive data such as Personally Identifiable Information (PII) -and Protected Health Information (PHI) is critical for data security platforms. -This study evaluates regex-based pattern matching algorithms and exact-match -search techniques to optimize detection speed, accuracy, and scalability. Our -benchmarking results indicate that Google RE2 provides the best balance of -speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among -regex engines, outperforming PCRE while maintaining broader hardware -compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated -superior performance (8 ms/MB) and scalability for large datasets. Performance -analysis revealed that regex processing time scales linearly with dataset size -and pattern complexity. A hybrid AI + Regex approach achieved the highest F1 -score (91. 6%) by improving recall and minimizing false positives. Device -benchmarking confirmed that our solution maintains efficient CPU and memory -usage on both high-performance and mid-range systems. Despite its -effectiveness, challenges remain, such as limited multilingual support and the -need for regular pattern updates. Future work should focus on expanding -language coverage, integrating data security and privacy management (DSPM) with -data loss prevention (DLP) tools, and enhancing regulatory compliance for -broader global adoption. +The rapid expansion of web content has made on-device AI assistants +indispensable for helping users manage the increasing complexity of online +tasks. The emergent reasoning ability in large language models offer a +promising path for next-generation on-device AI agents. However, deploying +full-scale Large Language Models (LLMs) on resource-limited local devices is +challenging. In this paper, we propose Division-of-Thoughts (DoT), a +collaborative reasoning framework leveraging the synergy between locally +deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT +leverages a Task Decomposer to elicit the inherent planning abilities in +language models to decompose user queries into smaller sub-tasks, which allows +hybrid language models to fully exploit their respective strengths. Besides, +DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks +and create a dependency graph, facilitating parallel reasoning of sub-tasks and +the identification of key steps. To allocate the appropriate model based on the +difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an +additional task head attached to the SLM that does not alter the SLM's +parameters. To boost adapter's task allocation capability, we propose a +self-reinforced training method that relies solely on task execution feedback. +Extensive experiments on various benchmarks demonstrate that our DoT +significantly reduces LLM costs while maintaining competitive reasoning +accuracy. Specifically, DoT reduces the average reasoning time and API costs by +66.12% and 83.57%, while achieving comparable reasoning accuracy with the best +baseline methods. -摘要:偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料,對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術,以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示,在 regex 引擎中,Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡,優於 PCRE,同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對,Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示,regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低,達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效,但仍有挑戰存在,例如多語言支援有限,以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍,將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合,以及加強法規遵循以利更廣泛的全球採用。 +摘要:網頁內容快速擴充,使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而,在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中,我們提出了思想分工 (DoT),一個協作推理框架,利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力,將使用者查詢分解成較小的子任務,這允許混合語言模型充分發揮其各自的優勢。此外,DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖,促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型,DoT 利用了即插即用適配器,這是一個附加在 SLM 上的任務頭,不會改變 SLM 的參數。為了提升適配器的任務分配能力,我們提出了一種自我強化訓練方法,它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明,我們的 DoT 大幅降低了 LLM 成本,同時維持了有競爭力的推理準確度。具體來說,DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%,同時達到了與最佳基準方法相當的推理準確度。 -##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch** -2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu +##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models** +2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong -While just-in-time interventions (JITIs) have effectively targeted common -health behaviors, individuals often have unique needs to intervene in personal -undesirable actions that can negatively affect physical, mental, and social -well-being. We present WatchGuardian, a smartwatch-based JITI system that -empowers users to define custom interventions for these personal actions with a -small number of samples. For the model to detect new actions based on limited -new data samples, we developed a few-shot learning pipeline that finetuned a -pre-trained inertial measurement unit (IMU) model on public hand-gesture -datasets. We then designed a data augmentation and synthesis process to train -additional classification layers for customization. Our offline evaluation with -26 participants showed that with three, five, and ten examples, our approach -achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of -74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to -compare WatchGuardian against a rule-based intervention. Our results -demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in -undesirable actions, substantially outperforming the baseline by 29.0%. Our -findings underscore the effectiveness of a customizable, AI-driven JITI system -for individuals in need of behavioral intervention in personal undesirable -actions. We envision that our work can inspire broader applications of -user-defined personalized intervention with advanced AI solutions. +Knowledge Graph-based recommendations have gained significant attention due +to their ability to leverage rich semantic relationships. However, constructing +and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy +of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent +advancements in Large Language Models (LLMs) offer a promising way to improve +the quality and relevance of KGs for recommendation tasks. Despite this, +integrating LLMs into KG-based systems presents challenges, such as efficiently +augmenting KGs, addressing hallucinations, and developing effective joint +learning methods. In this paper, we propose the Confidence-aware KG-based +Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework +that combines KGs and LLMs for recommendation task. The framework includes: (1) +an LLM-based subgraph augmenter for enriching KGs with high-quality +information, (2) a confidence-aware message propagation mechanism to filter +noisy triplets, and (3) a dual-view contrastive learning method to integrate +user-item interactions and KG data. Additionally, we employ a confidence-aware +explanation generation process to guide LLMs in producing realistic +explanations for recommendations. Finally, extensive experiments demonstrate +the effectiveness of CKG-LLMA across multiple public datasets. -摘要:雖然即時介入(JITIs)有效地針對常見的健康行為,但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian,這是一個基於智慧手錶的 JITI 系統,它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為,我們開發了一個小樣本學習管道,微調了公共手勢資料集上的預訓練慣性測量單元(IMU)模型。然後,我們設計了一個資料擴充和合成流程,以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示,我們的做法使用三個、五個和十個範例,達到了 76.8%、84.7% 和 87.7% 的平均準確度,以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後,我們進行了一項為時四小時的介入研究,以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明,我們的系統導致不良行為顯著減少了 64.0 +- 22.6%,大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用,並採用先進的 AI 解決方案。 +摘要:基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而,構建和維護知識圖譜 (KG) 是一項資源密集型任務,而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此,將 LLM 整合到基於 KG 的系統中會帶來挑戰,例如有效擴充 KG、處理幻覺,以及開發有效的聯合學習方法。在本文中,我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA),這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括:(1) 一個基於 LLM 的子圖擴充器,用於使用高品質資訊豐富 KG,(2) 一個信心感知型訊息傳播機制,用於過濾雜訊三元組,以及 (3) 一個雙視圖對比學習方法,用於整合使用者-項目互動和 KG 資料。此外,我們採用一個信心感知型解釋產生程序,以引導 LLM 為推薦產生逼真的解釋。最後,大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。 -##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care** -2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara +##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)** +2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell -Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group -of cancers that account for more than 35% of cancer-related deaths worldwide, -but postoperative complications are unpredictable and can be life-threatening. -In this paper, we investigate how recent advancements in large language models -(LLMs) can benefit remote patient monitoring (RPM) systems through clinical -integration by designing RECOVER, an LLM-powered RPM system for postoperative -GI cancer care. To closely engage stakeholders in the design process, we first -conducted seven participatory design sessions with five clinical staff and -interviewed five cancer patients to derive six major design strategies for -integrating clinical guidelines and information needs into LLM-based RPM -systems. We then designed and implemented RECOVER, which features an -LLM-powered conversational agent for cancer patients and an interactive -dashboard for clinical staff to enable efficient postoperative RPM. Finally, we -used RECOVER as a pilot system to assess the implementation of our design -strategies with four clinical staff and five patients, providing design -implications by identifying crucial design elements, offering insights on -responsible AI, and outlining opportunities for future LLM-powered RPM systems. +Scene graphs have emerged as a structured and serializable environment +representation for grounded spatial reasoning with Large Language Models +(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason +framework for reasoning and planning with scene graphs. Our approach employs +two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and +information queries generation, and a (2) Retriever for extracting +corresponding graph information following the queries. Two agents collaborate +iteratively, enabling sequential reasoning and adaptive attention to graph +information. Unlike prior works, both agents are prompted only with the scene +graph schema rather than the full graph data, which reduces the hallucination +by limiting input tokens, and drives the Reasoner to generate reasoning trace +abstractly.Following the trace, the Retriever programmatically query the scene +graph data based on the schema understanding, allowing dynamic and global +attention on the graph that enhances alignment between reasoning and retrieval. +Through experiments in multiple simulation environments, we show that our +framework surpasses existing LLM-based approaches in numerical Q\&A and +planning tasks, and can benefit from task-level few-shot examples, even in the +absence of agent-level demonstrations. Project code will be released. -摘要:癌症手術是胃腸道 (GI) 癌症的主要治療方式,這類癌症佔全球癌症相關死亡人數的 35% 以上,但術後併發症無法預測,且可能危及生命。在本文中,我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統,方法是設計 RECOVER,一個由 LLM 驅動的 RPM 系統,用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程,我們首先與五位臨床人員進行七場參與式設計會議,並訪談五位癌症患者,以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著,我們設計並實作 RECOVER,其特色在於一個由 LLM 驅動的對話式代理人,供癌症患者使用,以及一個互動式儀表板,供臨床人員使用,以進行有效的術後 RPM。最後,我們使用 RECOVER 作為試點系統,與四位臨床人員和五位患者評估我們設計策略的實作,並透過找出重要的設計元素、提供對負責任 AI 的見解,以及概述未來由 LLM 驅動的 RPM 系統的機會,提出設計意涵。 +摘要:場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中,我們提出 SG-RwR,一個以綱要為導向的檢索與推理框架,用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理:一個 (1) 推論器,用於任務規劃和資訊查詢產生,以及一個 (2) 檢索器,用於根據查詢提取對應的圖形資訊。兩個代理反覆合作,實現對圖形資訊的順序推理和適應性關注。與先前的作品不同,兩個代理僅提示場景圖表綱要,而不是完整的圖形資料,這透過限制輸入代碼減少了幻覺,並驅使推論器抽象地產生推理軌跡。根據軌跡,檢索器根據綱要理解以程式化方式查詢場景圖形資料,允許對圖形進行動態和整體關注,增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗,我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法,並且可以受益於任務級別的少次範例,即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。 -##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis** -2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander +##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs** +2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin -Understanding the progression trajectories of diseases is crucial for early -diagnosis and effective treatment planning. This is especially vital for -life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a -chronic, progressive lung disease with a prognosis comparable to many cancers. -Computed tomography (CT) imaging has been established as a reliable diagnostic -tool for IPF. Accurately predicting future CT scans of early-stage IPF patients -can aid in developing better treatment strategies, thereby improving survival -outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial -Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF -patients at any time point. The model is trained using a two-stage approach. In -the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the -second stage, a Neural Ordinary Differential Equation (ODE) based temporal -model is trained to capture the temporal dynamics of the quantised embeddings -generated by the encoder in the first stage. We evaluate different -configurations of our model for generating longitudinal CT scans and compare -the results against ground truth data, both quantitatively and qualitatively. -For validation, we conduct survival analysis using imaging biomarkers derived -from generated CT scans and achieve a C-index comparable to that of biomarkers -derived from the real CT scans. The survival analysis results demonstrate the -potential clinical utility inherent to generated longitudinal CT scans, showing -that they can reliably predict survival outcomes. +Recent advancements have highlighted that Large Language Models (LLMs) are +prone to hallucinations when solving complex reasoning problems, leading to +erroneous results. To tackle this issue, researchers incorporate Knowledge +Graphs (KGs) to improve the reasoning ability of LLMs. However, existing +methods face two limitations: 1) they typically assume that all answers to the +questions are contained in KGs, neglecting the incompleteness issue of KGs, and +2) they treat the KG as a static repository and overlook the implicit logical +reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an +innovative neural-symbolic agent framework that achieves collaborative +augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments +and transform complex reasoning tasks into a multi-step interactive process, +enabling KGs to participate deeply in the reasoning process. SymAgent consists +of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages +LLM's inductive reasoning capability to extract symbolic rules from KGs, +guiding efficient question decomposition. The Agent-Executor autonomously +invokes predefined action tools to integrate information from KGs and external +documents, addressing the issues of KG incompleteness. Furthermore, we design a +self-learning framework comprising online exploration and offline iterative +policy updating phases, enabling the agent to automatically synthesize +reasoning trajectories and improve performance. Experimental results +demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields +better or comparable performance compared to various strong baselines. Further +analysis reveals that our agent can identify missing triples, facilitating +automatic KG updates. -摘要:了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要,IPF 是一種慢性、進行性肺部疾病,其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略,從而改善存活結果。在本文中,我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN),這是一個模型,能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段,訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段,訓練基於神經常微分方程 (ODE) 的時間模型,以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置,以生成縱向 CT 掃描,並在定量和定性方面將結果與真實數據進行比較。為了驗證,我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析,並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用,表明它們可以可靠地預測存活結果。 +摘要:最近的進展強調出,大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺,導致錯誤的結果。為了解決這個問題,研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而,現有方法面臨兩個限制:1) 它們通常假設問題的所有答案都包含在 KG 中,忽略了 KG 的不完整性問題,以及 2) 它們將 KG 視為一個靜態儲存庫,而忽略了 KG 中固有的隱式邏輯推理結構。在本文中,我們介紹了 SymAgent,一個創新的神經符號代理架構,它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境,並將複雜的推理任務轉化為一個多步驟的互動過程,使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組:代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則,指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊,解決 KG 不完整性的問題。此外,我們設計了一個自學習框架,包括線上探索和離線反覆的政策更新階段,使代理能夠自動合成推理軌跡並改善效能。實驗結果表明,具有弱 LLM 主幹的 SymAgent(例如,7B 系列)與各種強大的基線相比,產生了更好或相當的效能。進一步的分析表明,我們的代理可以識別遺失的三元組,促進自動 KG 更新。 -##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy** -2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho +##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models** +2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov -The increasing demand for mental health services has led to the rise of -AI-driven mental health chatbots, though challenges related to privacy, data -collection, and expertise persist. Motivational Interviewing (MI) is gaining -attention as a theoretical basis for boosting expertise in the development of -these chatbots. However, existing datasets are showing limitations for training -chatbots, leading to a substantial demand for publicly available resources in -the field of MI and psychotherapy. These challenges are even more pronounced in -non-English languages, where they receive less attention. In this paper, we -propose a novel framework that simulates MI sessions enriched with the -expertise of professional therapists. We train an MI forecaster model that -mimics the behavioral choices of professional therapists and employ Large -Language Models (LLMs) to generate utterances through prompt engineering. Then, -we present KMI, the first synthetic dataset theoretically grounded in MI, -containing 1,000 high-quality Korean Motivational Interviewing dialogues. -Through an extensive expert evaluation of the generated dataset and the -dialogue model trained on it, we demonstrate the quality, expertise, and -practicality of KMI. We also introduce novel metrics derived from MI theory in -order to evaluate dialogues from the perspective of MI. +We introduce a new approach to systematically map features discovered by +sparse autoencoder across consecutive layers of large language models, +extending earlier work that examined inter-layer feature links. By using a +data-free cosine similarity technique, we trace how specific features persist, +transform, or first appear at each stage. This method yields granular flow +graphs of feature evolution, enabling fine-grained interpretability and +mechanistic insights into model computations. Crucially, we demonstrate how +these cross-layer feature maps facilitate direct steering of model behavior by +amplifying or suppressing chosen features, achieving targeted thematic control +in text generation. Together, our findings highlight the utility of a causal, +cross-layer interpretability framework that not only clarifies how features +develop through forward passes but also provides new means for transparent +manipulation of large language models. -摘要:由於對心理健康服務的需求日益增加,導致以人工智慧為基礎的心理健康聊天機器人興起,儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而,現有的資料集顯示出訓練聊天機器人的限制,導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯,因為它們受到的關注較少。在本文中,我們提出了一個新穎的架構,它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型,它模擬了專業治療師的行為選擇,並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後,我們展示了 KMI,這是第一個理論上以 MI 為基礎的合成資料集,其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估,我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標,以便從 MI 的角度評估對話。 +摘要:我們提出了一種新方法,用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能,擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術,我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖,實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是,我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導,在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用,不僅闡明了特徵如何透過前向傳遞發展,還提供了新的方法來透明地操作大型語言模型。 -##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports** -2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco +##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs** +2502.02896v1 by Bradley P. Allen, Paul T. Groth + +Evaluating large language models (LLMs) for tasks like fact extraction in +support of knowledge graph construction frequently involves computing accuracy +metrics using a ground truth benchmark based on a knowledge graph (KG). These +evaluations assume that errors represent factual disagreements. However, human +discourse frequently features metalinguistic disagreement, where agents differ +not on facts but on the meaning of the language used to express them. Given the +complexity of natural language processing and generation using LLMs, we ask: do +metalinguistic disagreements occur between LLMs and KGs? Based on an +investigation using the T-REx knowledge alignment dataset, we hypothesize that +metalinguistic disagreement does in fact occur between LLMs and KGs, with +potential relevance for the practice of knowledge graph engineering. We propose +a benchmark for evaluating the detection of factual and metalinguistic +disagreements between LLMs and KGs. An initial proof of concept of such a +benchmark is available on Github. -Europe's healthcare systems require enhanced interoperability and -digitalization, driving a demand for innovative solutions to process legacy -clinical data. This paper presents the results of our project, which aims to -leverage Large Language Models (LLMs) to extract structured information from -unstructured clinical reports, focusing on patient history, diagnoses, -treatments, and other predefined categories. We developed a workflow with a -user interface and evaluated LLMs of varying sizes through prompting strategies -and fine-tuning. Our results show that fine-tuned smaller models match or -surpass larger counterparts in performance, offering efficiency for -resource-limited settings. A new dataset of 60,000 annotated English clinical -summaries and 24,000 German translations was validated with automated and -manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. -The work highlights the approach's viability and outlines future improvements. +摘要:評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時,通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而,人類話語經常出現元語言分歧,其中代理人之間的差異不在於事實,而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性,我們提出疑問:LLM 和 KG 之間是否會發生元語言分歧?根據使用 T-REx 知識比對資料集進行的調查,我們假設元語言分歧確實會發生在 LLM 和 KG 之間,並可能與知識圖譜工程實務有關。我們提出一個基準,用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。 -摘要:歐洲的醫療保健系統需要增強互通性和數位化,這驅動了對創新解決方案的需求,以處理傳統的臨床數據。本文介紹了我們專案的成果,該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊,重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程,並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示,微調後的較小模型在效能上與較大的模型相匹配或超越它們,為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性,並概述了未來的改進。 +##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization** +2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim +Recent advances in Large Language Models (LLMs) have motivated the +development of general LLMs for molecular tasks. While several studies have +demonstrated that fine-tuned LLMs can achieve impressive benchmark +performances, they are far from genuine generalist molecular LLMs due to a lack +of fundamental understanding of molecular structure. Specifically, when given +molecular task instructions, LLMs trained with naive next-token prediction +training assign similar likelihood scores to both original and negatively +corrupted molecules, revealing their lack of molecular structure understanding +that is crucial for reliable and general molecular LLMs. To overcome this +limitation and obtain a true generalist molecular LLM, we introduce a novel +multi-modal training method based on a thorough multi-modal instruction tuning +as well as a molecular structure preference optimization between chosen and +rejected graphs. On various molecular benchmarks, the proposed generalist +molecular LLM, called Mol-LLM, achieves state-of-the-art performances among +generalist LLMs on most tasks, at the same time, surpassing or comparable to +state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior +generalization performances in reaction prediction tasks, demonstrating the +effect of the molecular structure understanding for generalization perspective. -### LLM -|Publish Date|Title|Authors|Homepage|Code| -| :---: | :---: | :---: | :---: | :---: | -|**2025-02-20**|**LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention**|Shang Yang et.al.|[2502.14866v1](http://arxiv.org/abs/2502.14866v1)|null| -|**2025-02-20**|**Interpretable Text Embeddings and Text Similarity Explanation: A Primer**|Juri Opitz et.al.|[2502.14862v1](http://arxiv.org/abs/2502.14862v1)|null| -|**2025-02-20**|**Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning**|Shuyue Stella Li et.al.|[2502.14860v1](http://arxiv.org/abs/2502.14860v1)|null| -|**2025-02-20**|**FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling**|Weilin Zhao et.al.|[2502.14856v1](http://arxiv.org/abs/2502.14856v1)|null| -|**2025-02-20**|**Prompt-to-Leaderboard**|Evan Frick et.al.|[2502.14855v1](http://arxiv.org/abs/2502.14855v1)|null| -|**2025-02-20**|**CLIPPER: Compression enables long-context synthetic data generation**|Chau Minh Pham et.al.|[2502.14854v1](http://arxiv.org/abs/2502.14854v1)|null| -|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| -|**2025-02-20**|**Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation**|Yue Yang et.al.|[2502.14846v1](http://arxiv.org/abs/2502.14846v1)|null| -|**2025-02-20**|**Revealing and Mitigating Over-Attention in Knowledge Editing**|Pinzheng Wang et.al.|[2502.14838v1](http://arxiv.org/abs/2502.14838v1)|null| -|**2025-02-20**|**Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs**|Tao Ji et.al.|[2502.14837v1](http://arxiv.org/abs/2502.14837v1)|null| -|**2025-02-20**|**LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models**|Shangqing Tu et.al.|[2502.14834v1](http://arxiv.org/abs/2502.14834v1)|null| -|**2025-02-20**|**Improving the Diffusability of Autoencoders**|Ivan Skorokhodov et.al.|[2502.14831v1](http://arxiv.org/abs/2502.14831v1)|null| -|**2025-02-20**|**Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs**|Danni Liu et.al.|[2502.14830v1](http://arxiv.org/abs/2502.14830v1)|null| -|**2025-02-20**|**Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps**|Martin Tutek et.al.|[2502.14829v1](http://arxiv.org/abs/2502.14829v1)|null| -|**2025-02-20**|**Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison**|Aiswarya Baby et.al.|[2502.14827v1](http://arxiv.org/abs/2502.14827v1)|null| -|**2025-02-20**|**eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables**|Luis Antonio Gutiérrez Guanilo et.al.|[2502.14820v1](http://arxiv.org/abs/2502.14820v1)|null| -|**2025-02-20**|**Optimizing Model Selection for Compound AI Systems**|Lingjiao Chen et.al.|[2502.14815v1](http://arxiv.org/abs/2502.14815v1)|null| -|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| -|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| -|**2025-02-20**|**A Survey on Text-Driven 360-Degree Panorama Generation**|Hai Wang et.al.|[2502.14799v1](http://arxiv.org/abs/2502.14799v1)|null| -|**2025-02-20**|**Rapid Word Learning Through Meta In-Context Learning**|Wentao Wang et.al.|[2502.14791v1](http://arxiv.org/abs/2502.14791v1)|null| -|**2025-02-20**|**SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features**|Michael Tschannen et.al.|[2502.14786v1](http://arxiv.org/abs/2502.14786v1)|[link](https://github.com/google-research/big_vision)| -|**2025-02-20**|**ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting**|Abhijit Mishra et.al.|[2502.14780v1](http://arxiv.org/abs/2502.14780v1)|null| -|**2025-02-20**|**Harnessing PDF Data for Improving Japanese Large Multimodal Models**|Jeonghun Baek et.al.|[2502.14778v1](http://arxiv.org/abs/2502.14778v1)|null| -|**2025-02-20**|**Making Universal Policies Universal**|Niklas Höpner et.al.|[2502.14777v1](http://arxiv.org/abs/2502.14777v1)|null| -|**2025-02-20**|**SurveyX: Academic Survey Automation via Large Language Models**|Xun Liang et.al.|[2502.14776v1](http://arxiv.org/abs/2502.14776v1)|null| -|**2025-02-20**|**Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning**|Tian Xie et.al.|[2502.14768v1](http://arxiv.org/abs/2502.14768v1)|null| -|**2025-02-20**|**Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis**|Priyanka Kargupta et.al.|[2502.14767v1](http://arxiv.org/abs/2502.14767v1)|null| -|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| -|**2025-02-20**|**EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations**|Haotian Zhai et.al.|[2502.14760v1](http://arxiv.org/abs/2502.14760v1)|null| -|**2025-02-20**|**On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems**|Juraj Vladika et.al.|[2502.14759v1](http://arxiv.org/abs/2502.14759v1)|null| -|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| -|**2025-02-20**|**TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators**|Jianling Li et.al.|[2502.14752v1](http://arxiv.org/abs/2502.14752v1)|null| -|**2025-02-20**|**Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs**|Zongxia Li et.al.|[2502.14748v1](http://arxiv.org/abs/2502.14748v1)|null| -|**2025-02-20**|**HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States**|Yilei Jiang et.al.|[2502.14744v1](http://arxiv.org/abs/2502.14744v1)|null| -|**2025-02-20**|**Multi-Agent Coordination across Diverse Applications: A Survey**|Lijun Sun et.al.|[2502.14743v1](http://arxiv.org/abs/2502.14743v1)|null| -|**2025-02-20**|**YOLOv12: A Breakdown of the Key Architectural Features**|Mujadded Al Rabbani Alif et.al.|[2502.14740v1](http://arxiv.org/abs/2502.14740v1)|null| -|**2025-02-20**|**SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines**|M-A-P Team et.al.|[2502.14739v1](http://arxiv.org/abs/2502.14739v1)|null| -|**2025-02-20**|**EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration**|Minjie Hong et.al.|[2502.14735v1](http://arxiv.org/abs/2502.14735v1)|null| -|**2025-02-20**|**Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models**|Hongji Li et.al.|[2502.14734v1](http://arxiv.org/abs/2502.14734v1)|null| -|**2025-02-20**|**WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models**|Yifu Chen et.al.|[2502.14727v1](http://arxiv.org/abs/2502.14727v1)|null| -|**2025-02-20**|**Entity Framing and Role Portrayal in the News**|Tarek Mahmoud et.al.|[2502.14718v1](http://arxiv.org/abs/2502.14718v1)|null| -|**2025-02-20**|**From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT**|Ahmed Abdeen Hamed et.al.|[2502.14714v1](http://arxiv.org/abs/2502.14714v1)|null| -|**2025-02-20**|**Data-Efficient Pretraining with Group-Level Data Influence Modeling**|Zichun Yu et.al.|[2502.14709v1](http://arxiv.org/abs/2502.14709v1)|null| -|**2025-02-20**|**Human Misperception of Generative-AI Alignment: A Laboratory Experiment**|Kevin He et.al.|[2502.14708v1](http://arxiv.org/abs/2502.14708v1)|null| -|**2025-02-20**|**Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting**|Yuxuan Yang et.al.|[2502.14704v1](http://arxiv.org/abs/2502.14704v1)|null| -|**2025-02-20**|**I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search**|Zujie Liang et.al.|[2502.14693v1](http://arxiv.org/abs/2502.14693v1)|null| -|**2025-02-20**|**Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup**|Yonghui Kong et.al.|[2502.14682v1](http://arxiv.org/abs/2502.14682v1)|null| -|**2025-02-20**|**How to Get Your LLM to Generate Challenging Problems for Evaluation**|Arkil Patel et.al.|[2502.14678v1](http://arxiv.org/abs/2502.14678v1)|null| -|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| -|**2025-02-20**|**BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction**|Ruochen Li et.al.|[2502.14676v1](http://arxiv.org/abs/2502.14676v1)|null| -|**2025-02-20**|**Explanations of Deep Language Models Explain Language Representations in the Brain**|Maryam Rahimi et.al.|[2502.14671v1](http://arxiv.org/abs/2502.14671v1)|null| -|**2025-02-20**|**AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO**|Alan Dao et.al.|[2502.14669v1](http://arxiv.org/abs/2502.14669v1)|null| -|**2025-02-20**|**InstructAgent: Building User Controllable Recommender via LLM Agent**|Wujiang Xu et.al.|[2502.14662v1](http://arxiv.org/abs/2502.14662v1)|null| -|**2025-02-20**|**Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs**|Yuchen Wu et.al.|[2502.14645v1](http://arxiv.org/abs/2502.14645v1)|null| -|**2025-02-20**|**LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning**|Yansheng Mao et.al.|[2502.14644v1](http://arxiv.org/abs/2502.14644v1)|null| -|**2025-02-20**|**Length-Controlled Margin-Based Preference Optimization without Reference Model**|Gengxu Li et.al.|[2502.14643v1](http://arxiv.org/abs/2502.14643v1)|null| -|**2025-02-20**|**How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation**|Rui Li et.al.|[2502.14642v1](http://arxiv.org/abs/2502.14642v1)|null| -|**2025-02-20**|**NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization**|Zheyuan Zhang et.al.|[2502.14638v1](http://arxiv.org/abs/2502.14638v1)|null| -|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| -|**2025-02-20**|**PEARL: Towards Permutation-Resilient LLMs**|Liang Chen et.al.|[2502.14628v1](http://arxiv.org/abs/2502.14628v1)|null| -|**2025-02-20**|**ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors**|Yuguo Yin et.al.|[2502.14627v1](http://arxiv.org/abs/2502.14627v1)|null| -|**2025-02-20**|**Multi-Record Web Page Information Extraction From News Websites**|Alexander Kustenkov et.al.|[2502.14625v1](http://arxiv.org/abs/2502.14625v1)|null| -|**2025-02-20**|**Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity**|Xinghan Pan et.al.|[2502.14620v1](http://arxiv.org/abs/2502.14620v1)|[link](https://github.com/PStarH/RWKV-embedding)| -|**2025-02-20**|**Reward Models Identify Consistency, Not Causality**|Yuhui Xu et.al.|[2502.14619v1](http://arxiv.org/abs/2502.14619v1)|null| -|**2025-02-20**|**FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis**|Mingyi Jia et.al.|[2502.14614v1](http://arxiv.org/abs/2502.14614v1)|null| -|**2025-02-20**|**Behavioral Analysis of Information Salience in Large Language Models**|Jan Trienes et.al.|[2502.14613v1](http://arxiv.org/abs/2502.14613v1)|null| -|**2025-02-20**|**A Theory for Conditional Generative Modeling on Multiple Data Sources**|Rongzhen Wang et.al.|[2502.14583v1](http://arxiv.org/abs/2502.14583v1)|null| -|**2025-02-20**|**A Statistical Case Against Empirical Human-AI Alignment**|Julian Rodemann et.al.|[2502.14581v1](http://arxiv.org/abs/2502.14581v1)|null| -|**2025-02-20**|**ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification**|Hyunseok Lee et.al.|[2502.14565v1](http://arxiv.org/abs/2502.14565v1)|null| -|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| -|**2025-02-20**|**Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs**|Paris Koloveas et.al.|[2502.14561v1](http://arxiv.org/abs/2502.14561v1)|null| -|**2025-02-20**|**Less is More: Improving LLM Alignment via Preference Data Selection**|Xun Deng et.al.|[2502.14560v1](http://arxiv.org/abs/2502.14560v1)|null| -|**2025-02-20**|**FUIA: Model Inversion Attack against Federated Unlearning**|Lei Zhou et.al.|[2502.14558v1](http://arxiv.org/abs/2502.14558v1)|null| -|**2025-02-20**|**Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling**|Eric Egli et.al.|[2502.14553v1](http://arxiv.org/abs/2502.14553v1)|null| -|**2025-02-20**|**Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks**|Maya Bechler-Speicher et.al.|[2502.14546v1](http://arxiv.org/abs/2502.14546v1)|null| -|**2025-02-20**|**LLM-based User Profile Management for Recommender System**|Seunghwan Bang et.al.|[2502.14541v1](http://arxiv.org/abs/2502.14541v1)|null| -|**2025-02-20**|**LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization**|Yupeng Chang et.al.|[2502.14538v1](http://arxiv.org/abs/2502.14538v1)|null| -|**2025-02-20**|**CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models**|Zhenhong Zhou et.al.|[2502.14529v1](http://arxiv.org/abs/2502.14529v1)|null| -|**2025-02-20**|**Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting**|Yannick Wölker et.al.|[2502.14525v1](http://arxiv.org/abs/2502.14525v1)|null| -|**2025-02-20**|**Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation**|Austin A. Barr et.al.|[2502.14523v1](http://arxiv.org/abs/2502.14523v1)|null| -|**2025-02-20**|**MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality**|Artur Kot et.al.|[2502.14509v1](http://arxiv.org/abs/2502.14509v1)|null| -|**2025-02-20**|**Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases**|Rena Gao et.al.|[2502.14507v1](http://arxiv.org/abs/2502.14507v1)|null| -|**2025-02-20**|**PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models**|Yu Meng et.al.|[2502.14504v1](http://arxiv.org/abs/2502.14504v1)|null| -|**2025-02-20**|**How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?**|Sergey Pletenev et.al.|[2502.14502v1](http://arxiv.org/abs/2502.14502v1)|null| -|**2025-02-20**|**Towards a Perspectivist Turn in Argument Quality Assessment**|Julia Romberg et.al.|[2502.14501v1](http://arxiv.org/abs/2502.14501v1)|null| -|**2025-02-20**|**MLGym: A New Framework and Benchmark for Advancing AI Research Agents**|Deepak Nathani et.al.|[2502.14499v1](http://arxiv.org/abs/2502.14499v1)|null| -|**2025-02-20**|**Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups**|Felix Drinkall et.al.|[2502.14497v1](http://arxiv.org/abs/2502.14497v1)|null| -|**2025-02-20**|**Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization**|Zhitao He et.al.|[2502.14496v1](http://arxiv.org/abs/2502.14496v1)|null| -|**2025-02-20**|**StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following**|Jinnan Li et.al.|[2502.14494v1](http://arxiv.org/abs/2502.14494v1)|null| -|**2025-02-20**|**Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk**|Elija Perrier et.al.|[2502.14491v1](http://arxiv.org/abs/2502.14491v1)|null| -|**2025-02-20**|**Temporal Misalignment and Probabilistic Neurons**|Velibor Bojković et.al.|[2502.14487v1](http://arxiv.org/abs/2502.14487v1)|null| -|**2025-02-20**|**How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation**|Zhuohang Long et.al.|[2502.14486v1](http://arxiv.org/abs/2502.14486v1)|null| -|**2025-02-20**|**NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models**|Chenlu Guo et.al.|[2502.14482v1](http://arxiv.org/abs/2502.14482v1)|null| -|**2025-02-20**|**Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression**|Haoyu Wang et.al.|[2502.14477v1](http://arxiv.org/abs/2502.14477v1)|null| -|**2025-02-20**|**Argument-Based Comparative Question Answering Evaluation Benchmark**|Irina Nikishina et.al.|[2502.14476v1](http://arxiv.org/abs/2502.14476v1)|null| -|**2025-02-20**|**Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models**|Aurora Polo-Rodríguez et.al.|[2502.14469v1](http://arxiv.org/abs/2502.14469v1)|null| -|**2025-02-20**|**Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing**|Aviv Bick et.al.|[2502.14458v1](http://arxiv.org/abs/2502.14458v1)|null| -|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| -|**2025-02-20**|**Optimal word order for non-causal text generation with Large Language Models: the Spanish case**|Andrea Busto-Castiñeira et.al.|[2502.14451v1](http://arxiv.org/abs/2502.14451v1)|null| +摘要:大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能,但由於缺乏對分子結構的基本理解,它們遠非真正的通才分子 LLM。具體來說,當給予分子任務說明時,使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子,這顯示出它們缺乏對分子結構的理解,而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM,我們引入了一種新穎的多模態訓練方法,該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中,所提出的通才分子 LLM(稱為 Mol-LLM)在多數任務中實現了通才 LLM 中的最新效能,同時超越或與最新的專家 LLM 相當。此外,Mol-LLM 在反應預測任務中也展現出優異的泛化效能,證明了分子結構理解對泛化觀點的影響。 -#### Abstracts -##### **LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention** -2502.14866v1 by Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han +##### **Leveraging the true depth of LLMs** +2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret -Large language models (LLMs) have shown remarkable potential in processing -long sequences, yet efficiently serving these long-context models remains -challenging due to the quadratic computational complexity of attention in the -prefilling stage and the large memory footprint of the KV cache in the decoding -stage. To address these issues, we introduce LServe, an efficient system that -accelerates long-sequence LLM serving via hybrid sparse attention. This method -unifies different hardware-friendly, structured sparsity patterns for both -prefilling and decoding attention into a single framework, where computations -on less important tokens are skipped block-wise. LServe demonstrates the -compatibility of static and dynamic sparsity in long-context LLM attention. -This design enables multiplicative speedups by combining these optimizations. -Specifically, we convert half of the attention heads to nearly free streaming -heads in both the prefilling and decoding stages. Additionally, we find that -only a constant number of KV pages is required to preserve long-context -capabilities, irrespective of context length. We then design a hierarchical KV -page selection policy that dynamically prunes KV pages based on query-centric -similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and -decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is -released at https://github.com/mit-han-lab/omniserve. +Large Language Models demonstrate remarkable capabilities at the cost of high +compute requirements. While recent research has shown that intermediate layers +can be removed or have their order shuffled without impacting performance +significantly, these findings have not been employed to reduce the +computational cost of inference. We investigate several potential ways to +reduce the depth of pre-trained LLMs without significantly affecting +performance. Leveraging our insights, we present a novel approach that exploits +this decoupling between layers by grouping some of them into pairs that can be +evaluated in parallel. + This modification of the computational graph -- through better parallelism -- +results in an average improvement of around 1.20x on the number of tokens +generated per second, without re-training nor fine-tuning, while retaining +95%-99% of the original accuracy. Empirical evaluation demonstrates that this +approach significantly improves serving efficiency while maintaining model +performance, offering a practical improvement for large-scale LLM deployment. -摘要:大型語言模型 (LLM) 在處理長序列方面展現出驚人的潛力,但由於預填充階段注意力的二次計算複雜度和解碼階段 KV 快取的大量記憶體使用量,有效提供這些長語境模型服務仍然具有挑戰性。為了解決這些問題,我們引入了 LServe,一個透過混合稀疏注意力加速長序列 LLM 服務的高效系統。此方法將不同的硬體友善的結構化稀疏模式統一到一個單一的架構中,用於預填充和解碼注意力,其中對較不重要的符號的運算會以區塊方式略過。LServe 證明了靜態和動態稀疏性在長語境 LLM 注意力中的相容性。此設計透過結合這些最佳化來實現倍增加速。具體來說,我們將一半的注意力頭轉換為預填充和解碼階段中幾乎免費的串流頭。此外,我們發現僅需要恆定的 KV 頁數來保留長語境功能,而與語境長度無關。然後,我們設計了一個分層式 KV 頁面選擇策略,根據以查詢為中心的相似性動態刪除 KV 頁面。平均而言,LServe 將 LLM 預填充加速了 2.9 倍,將解碼加速了 1.3-2.1 倍,同時維持長語境的準確性。程式碼已發布在 https://github.com/mit-han-lab/omniserve。 +摘要:大型语言模型展示了其强大的功能,但代价是较高的计算需求。虽然最近的研究表明,中间层可以被移除或重新排列其顺序,而不会显著影响性能,但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度,而不会显著影响性能。利用我们的见解,我们提出了一种新颖的方法,该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。 +通过更好的并行性对计算图进行修改,平均而言,每秒生成的令牌数量提高了约 1.20 倍,而无需重新训练或微调,同时保留了 95%-99% 的原始准确性。经验评估表明,这种方法显著提高了服务效率,同时保持了模型性能,为大规模 LLM 部署提供了实际改进。 -##### **Interpretable Text Embeddings and Text Similarity Explanation: A Primer** -2502.14862v1 by Juri Opitz, Lucas Möller, Andrianos Michail, Simon Clematide +##### **Modular Training of Neural Networks aids Interpretability** +2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots -Text embeddings and text embedding models are a backbone of many AI and NLP -systems, particularly those involving search. However, interpretability -challenges persist, especially in explaining obtained similarity scores, which -is crucial for applications requiring transparency. In this paper, we give a -structured overview of interpretability methods specializing in explaining -those similarity scores, an emerging research area. We study the methods' -individual ideas and techniques, evaluating their potential for improving -interpretability of text embeddings and explaining predicted similarities. +An approach to improve neural network interpretability is via clusterability, +i.e., splitting a model into disjoint clusters that can be studied +independently. We define a measure for clusterability and show that pre-trained +models form highly enmeshed clusters via spectral graph clustering. We thus +train models to be more modular using a "clusterability loss" function that +encourages the formation of non-interacting clusters. Using automated +interpretability techniques, we show that our method can help train models that +are more modular and learn different, disjoint, and smaller circuits. We +investigate CNNs trained on MNIST and CIFAR, small transformers trained on +modular addition, and language models. Our approach provides a promising +direction for training neural networks that learn simpler functions and are +easier to interpret. -摘要:文字嵌入和文字嵌入模型是許多 AI 和 NLP 系統的骨幹,特別是那些涉及搜尋的系統。然而,可解釋性的挑戰依然存在,特別是在解釋獲得的相似度分數時,這對於需要透明度的應用程式至關重要。在本文中,我們對專門用於解釋這些相似度分數的可解釋性方法給予結構化的概述,這是一個新興的研究領域。我們研究了這些方法的個別想法和技術,評估它們改善文字嵌入的可解釋性和解釋預測相似度的潛力。 +摘要:一種改善神經網路可解釋性的方法是透過群集性, +也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量,並顯示預訓練的 +模型透過光譜圖形群集形成高度糾纏的群集。因此,我們使用「群集性損失」函數訓練模型,使其更具模組化, +這鼓勵形成非交互群集。使用自動化可解釋性技術,我們顯示我們的模型可以幫助訓練更具模組化的模型,並學習不同、不相交且較小的電路。我們 +研究了在 MNIST 和 CIFAR 上訓練的 CNN,在模組化加法上訓練的小型Transformer,以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。 -##### **Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning** -2502.14860v1 by Shuyue Stella Li, Jimin Mun, Faeze Brahman, Jonathan S. Ilgen, Yulia Tsvetkov, Maarten Sap +##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs** +2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür -Large language models (LLMs) often fail to ask effective questions under -uncertainty, making them unreliable in domains where proactive -information-gathering is essential for decisionmaking. We present ALFA, a -framework that improves LLM question-asking by (i) decomposing the notion of a -"good" question into a set of theory-grounded attributes (e.g., clarity, -relevance), (ii) controllably synthesizing attribute-specific question -variations, and (iii) aligning models via preference-based optimization to -explicitly learn to ask better questions along these fine-grained attributes. -Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs -dataset, composed of 17k real-world clinical interactions augmented with 80k -attribute-specific preference pairs of follow-up questions, as well as a novel -expert-annotated interactive healthcare QA task to evaluate question-asking -abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on -MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level -win-rate of 64.4% and strong generalizability. Our findings suggest that -explicitly guiding question-asking with structured, fine-grained attributes -offers a scalable path to improve LLMs, especially in expert application -domains. +Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large +language models (LLMs) by enabling detailed step-by-step solutions. However, +due to the verbosity of LLMs, the resulting reasoning chains can be long, +making it harder to verify the reasoning steps and trace issues resulting from +dependencies between the steps that may be farther away in the sequence of +steps. Importantly, mathematical reasoning allows each step to be derived from +a small set of premises, which are a subset of the preceding steps in the +reasoning chain. In this paper, we present a framework that identifies the +premises for each step, to improve the evaluation of reasoning. We restructure +conventional linear reasoning chains into Premise Augmented Reasoning Chains +(PARC) by introducing premise links, resulting in a directed acyclic graph +where the nodes are the steps and the edges are the premise links. Through +experiments with a PARC-based dataset that we built, namely PERL (Premises and +ERrors identification in LLMs), we demonstrate that LLMs can reliably identify +premises within complex reasoning chains. In particular, even open-source LLMs +achieve 90% recall in premise identification. We also show that PARC helps to +identify errors in reasoning chains more reliably. The accuracy of error +identification improves by 6% to 16% absolute when step-by-step verification is +carried out in PARC under the premises. Our findings highlight the utility of +premise-centric representations in addressing complex problem-solving tasks and +open new avenues for improving the reliability of LLM-based reasoning +evaluations. -摘要:大型語言模型 (LLM) 經常在不確定性下無法提出有效問題,這使得它們在主動收集資訊對於決策制定至關重要的領域中不可靠。我們提出 ALFA,一個透過 (i) 將「良好」問題的概念分解成一組以理論為基礎的屬性(例如,清晰度、相關性),(ii) 可控地合成屬性特定的問題變體,以及 (iii) 透過基於偏好的最佳化調整模型,明確學習沿著這些細緻屬性提出更好的問題,來改善 LLM 提問的架構。專注於臨床推理作為案例研究,我們引入了 MediQ-AskDocs 資料集,由 17k 個真實世界的臨床互動組成,並增加了 80k 個屬性特定的後續問題偏好配對,以及一個由專家註解的互動式醫療保健問答任務來評估提問能力。與 SOTA 指令調整的 LLM 相比,與 ALFA 對齊的模型將 MediQ-AskDocs 上的診斷錯誤減少了 56.6%,問題層級的勝率為 64.4%,並且具有很強的普遍性。我們的研究結果表明,明確地以結構化、細緻的屬性來引導提問,提供了一條可擴充的途徑來改善 LLM,特別是在專家應用領域。 +摘要:思考鏈(CoT)提示透過提供詳細的逐步解法,增強大型語言模型(LLM)的數學推理能力。然而,由於 LLM 的冗長,產生的推理鏈可能很長,這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難,而這些步驟可能在步驟順序中相距較遠。重要的是,數學推理允許每個步驟從一組小的前提中推導出來,這些前提是推理鏈中前一個步驟的子集。在本文中,我們提出了一個框架,用於識別每個步驟的前提,以改進推理評估。我們透過引入前提連結,將傳統的線性推理鏈重組為前提擴充推理鏈(PARC),產生一個有向無環圖,其中節點是步驟,而邊緣是前提連結。透過我們建立的基於 PARC 的資料集(即 PERL(LLM 中的前提和錯誤識別))進行的實驗,我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是,即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明,PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時,錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用,並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。 -##### **FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling** -2502.14856v1 by Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun +##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement** +2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna -Speculative sampling has emerged as an important technique for accelerating -the auto-regressive generation process of large language models (LLMs) by -utilizing a draft-then-verify mechanism to produce multiple tokens per forward -pass. While state-of-the-art speculative sampling methods use only a single -layer and a language modeling (LM) head as the draft model to achieve -impressive layer compression, their efficiency gains are substantially reduced -for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. -To address this, we present FR-Spec, a frequency-ranked speculative sampling -framework that optimizes draft candidate selection through vocabulary space -compression. By constraining the draft search to a frequency-prioritized token -subset, our method reduces LM Head computation overhead by 75% while ensuring -the equivalence of the final output distribution. Experiments across multiple -datasets demonstrate an average of 1.12$\times$ speedup over the -state-of-the-art speculative sampling method EAGLE-2. +Embodied agents assisting humans are often asked to complete a new task in a +new scenario. An agent preparing a particular dish in the kitchen based on a +known recipe may be asked to prepare a new dish or to perform cleaning tasks in +the storeroom. There may not be sufficient resources, e.g., time or labeled +examples, to train the agent for these new situations. Large Language Models +(LLMs) trained on considerable knowledge across many domains are able to +predict a sequence of abstract actions for such new tasks and scenarios, +although it may not be possible for the agent to execute this action sequence +due to task-, agent-, or domain-specific constraints. Our framework addresses +these challenges by leveraging the generic predictions provided by LLM and the +prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an +agent to quickly adapt to new tasks and scenarios. The robot also solicits and +uses human input as needed to refine its existing knowledge. Based on +experimental evaluation over cooking and cleaning tasks in simulation domains, +we demonstrate that the interplay between LLM, KG, and human input leads to +substantial performance gains compared with just using the LLM output. -摘要:推測取樣已成為一種重要的技術,可用於透過利用先起草後驗證的機制來加速大型語言模型 (LLM) 的自迴歸生成過程,並在每次前向傳遞中產生多個代幣。儘管最先進的推測取樣方法只使用單一層和語言建模 (LM) 頭作為起草模型,以達成令人印象深刻的層壓縮,但對於大型詞彙表 LLM(例如詞彙表包含 128k 個代幣的 Llama-3-8B),其效率提升會大幅降低。為了解決這個問題,我們提出了 FR-Spec,這是一種頻率排序推測取樣架構,它透過詞彙空間壓縮來最佳化起草候選選取。我們的這個方法透過將起草搜尋限制在優先於頻率的代幣子集中,將 LM 頭部運算開銷減少了 75%,同時確保最終輸出分佈的等效性。透過多個資料集的實驗證明,與最先進的推測取樣方法 EAGLE-2 相比,平均提速了 1.12 倍。 +摘要:具身代理协助人类时,通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源(例如时间或标记的示例)来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列,尽管代理可能无法执行此动作序列,因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战,使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估,我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。 -##### **Prompt-to-Leaderboard** -2502.14855v1 by Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica +##### **On Bob Dylan: A Computational Perspective** +2502.01772v1 by Prashant Garg -Large language model (LLM) evaluations typically rely on aggregated metrics -like accuracy or human preference, averaging across users and prompts. This -averaging obscures user- and prompt-specific variations in model performance. -To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces -leaderboards specific to a prompt. The core idea is to train an LLM taking -natural language prompts as input to output a vector of Bradley-Terry -coefficients which are then used to predict the human preference vote. The -resulting prompt-dependent leaderboards allow for unsupervised task-specific -evaluation, optimal routing of queries to models, personalization, and -automated evaluation of model strengths and weaknesses. Data from Chatbot Arena -suggest that P2L better captures the nuanced landscape of language model -performance than the averaged leaderboard. Furthermore, our findings suggest -that P2L's ability to produce prompt-specific evaluations follows a power law -scaling similar to that observed in LLMs themselves. In January 2025, the -router we trained based on this methodology achieved the \#1 spot in the -Chatbot Arena leaderboard. Our code is available at this GitHub link: -https://github.com/lmarena/p2l. +Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style +-- a constant refusal to conform to expectation and a penchant for reinventing +his musical and lyrical identity. In this paper, I extend Sunstein's +observations through a large-scale computational analysis of Dylan's lyrics +from 1962 to 2012. Using o3-mini-high (a large language model), I extract +concept-to-concept relationships from the lyrics and construct directed +knowledge graphs that capture Dylan's thematic structure. I then quantify +shifts in sentiment, metaphorical expression, thematic diversity, and network +complexity over time. The results indicate that Dylan's lyrics increasingly +rely on metaphor, display an evolving sentiment profile, and exhibit heightened +dishabituation -- measured here as a growing variance in the network centrality +of key concepts. I also find that references to movement, protest, and mythic +imagery fluctuate in ways that align with well-known phases of Dylan's career, +reflecting the dynamic and unpredictable quality of his art. These findings not +only deepen our empirical understanding of Sunstein's thesis but also introduce +a novel computational method for analyzing an artist's evolution-offering +broader applicability to the study of cultural and creative change. -摘要:大型語言模型 (LLM) 評估通常依賴於彙總的指標,例如準確性或人類偏好,平均值跨使用者和提示。此平均值模糊了使用者和提示特定的模型效能變異。為了解決此問題,我們提出提示到排行榜 (P2L),一種產生特定於提示的排行榜的方法。核心概念是訓練 LLM,將自然語言提示作為輸入,以輸出 Bradley-Terry 係數向量,然後用於預測人類偏好投票。產生的提示相關排行榜允許無監督任務特定評估、最佳查詢路由至模型、個人化以及模型優缺點的自動化評估。來自 Chatbot Arena 的資料表明,P2L 比平均排行榜更能捕捉語言模型效能的細微變化。此外,我們的研究結果表明,P2L 產生提示特定評估的能力遵循類似於 LLM 本身觀察到的冪律縮放。2025 年 1 月,我們根據此方法訓練的路由器在 Chatbot Arena 排行榜中獲得了第一名。我們的程式碼可在 GitHub 連結取得:https://github.com/lmarena/p2l。 +摘要:卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格 +-- 這種風格不斷拒絕符合預期,並熱衷於重新塑造他的音樂和歌詞認同。在本文中,我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析,來延伸桑斯坦的觀察。使用 o3-mini-high(一個大型語言模型),我從歌詞中提取概念對概念的關係,並建構有向知識圖,以捕捉迪倫的主題結構。然後,我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示,迪倫的歌詞越來越依賴隱喻,展現出不斷演化的情緒輪廓,並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現,對運動、抗議和神話意象的引用,會以與迪倫職業生涯中眾所周知階段一致的方式波動,反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解,也引入了分析藝術家演變的新穎運算方法,為文化和創造性變化的研究提供了更廣泛的適用性。 -##### **CLIPPER: Compression enables long-context synthetic data generation** -2502.14854v1 by Chau Minh Pham, Yapei Chang, Mohit Iyyer +##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos** +2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang -LLM developers are increasingly reliant on synthetic data, but generating -high-quality data for complex long-context reasoning tasks remains challenging. -We introduce CLIPPER, a compression-based approach for generating synthetic -data tailored to narrative claim verification - a task that requires reasoning -over a book to verify a given claim. Instead of generating claims directly from -the raw text of the book, which results in artifact-riddled claims, CLIPPER -first compresses the book into chapter outlines and book summaries and then -uses these intermediate representations to generate complex claims and -corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces -claims that are more valid, grounded, and complex. Using CLIPPER, we construct -a dataset of 19K synthetic book claims paired with their source texts and -chain-of-thought reasoning, and use it to fine-tune three open-weight models. -Our best model achieves breakthrough results on narrative claim verification -(from 28% to 76% accuracy on our test set) and sets a new state-of-the-art for -sub-10B models on the NoCha leaderboard. Further analysis shows that our models -generate more detailed and grounded chain-of-thought reasoning while also -improving performance on other narrative understanding tasks (e.g., -NarrativeQA). +Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in +enhancing Large Language Models (LLMs) through external knowledge integration, +yet its application has primarily focused on textual content, leaving the rich +domain of multi-modal video knowledge predominantly unexplored. This paper +introduces VideoRAG, the first retrieval-augmented generation framework +specifically designed for processing and understanding extremely long-context +videos. Our core innovation lies in its dual-channel architecture that +seamlessly integrates (i) graph-based textual knowledge grounding for capturing +cross-video semantic relationships, and (ii) multi-modal context encoding for +efficiently preserving visual features. This novel design empowers VideoRAG to +process unlimited-length videos by constructing precise knowledge graphs that +span multiple videos while maintaining semantic dependencies through +specialized multi-modal retrieval paradigms. Through comprehensive empirical +evaluation on our proposed LongerVideos benchmark-comprising over 160 videos +totaling 134+ hours across lecture, documentary, and entertainment +categories-VideoRAG demonstrates substantial performance compared to existing +RAG alternatives and long video understanding methods. The source code of +VideoRAG implementation and the benchmark dataset are openly available at: +https://github.com/HKUDS/VideoRAG. -摘要:LLM 開發人員越來越依賴合成資料,但為複雜的長語境推理任務生成高品質資料仍然具有挑戰性。我們引入了 CLIPPER,一種基於壓縮的方法,用於生成針對敘事性聲明驗證量身打造的合成資料,這項任務需要對一本書進行推理才能驗證給定的聲明。CLIPPER 沒有直接從書籍的原始文字生成聲明,這會產生充滿人工製品的聲明,而是先將書籍壓縮成章節大綱和書籍摘要,然後使用這些中間表示來生成複雜的聲明和對應的思維鏈。與天真的方法相比,CLIPPER 產生的聲明更有效、更有根據且更複雜。使用 CLIPPER,我們構建了一個包含 19K 個合成書籍聲明及其原始文字和思維鏈推理的資料集,並用於微調三個開放權重模型。我們最好的模型在敘事性聲明驗證方面取得了突破性的結果(在我們的測試集中準確率從 28% 提升到 76%),並在 NoCha 排行榜上為低於 10B 的模型設定了新的技術水準。進一步的分析表明,我們的模型生成了更詳細且有根據的思維鏈推理,同時也提高了其他敘事理解任務(例如 NarrativeQA)的效能。 +摘要:檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功,但其應用主要集中在文字內容上,而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG,這是第一個檢索增強生成架構,專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構,它無縫整合 (i) 基於圖形文字知識基礎,用於擷取跨影片語義關係,以及 (ii) 多模態語境編碼,用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片,同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估,該基準包含超過 160 部影片,總時數超過 134 小時,涵蓋演講、紀錄片和娛樂類別,VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比,展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於:https://github.com/HKUDS/VideoRAG。 -##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** -2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu +##### **Transformers trained on proteins can learn to attend to Euclidean distance** +2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane -Large Language Models (LLMs) have shown great promise in tool-making, yet -existing frameworks often struggle to efficiently construct reliable toolsets -and are limited to single-task settings. To address these challenges, we -propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that -dynamically constructs and evolves a hierarchical graph of reusable tools -across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), -agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, -TabMWP). Our results show that GATE achieves up to 4.3x faster milestone -completion in Minecraft compared to the previous SOTA, and provides an average -improvement of 9.23% over existing tool-making methods in code generation tasks -and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, -balancing tool quantity, complexity, and functionality while maintaining high -efficiency. Code and data are available at -\url{https://github.com/ayanami2003/GATE}. +While conventional Transformers generally operate on sequence data, they can +be used in conjunction with structure models, typically SE(3)-invariant or +equivariant graph neural networks (GNNs), for 3D applications such as protein +structure modelling. These hybrids typically involve either (1) +preprocessing/tokenizing structural features as input for Transformers or (2) +taking Transformer embeddings and processing them within a structural +representation. However, there is evidence that Transformers can learn to +process structural information on their own, such as the AlphaFold3 structural +diffusion model. In this work we show that Transformers can function +independently as structure models when passed linear embeddings of coordinates. +We first provide a theoretical explanation for how Transformers can learn to +filter attention as a 3D Gaussian with learned variance. We then validate this +theory using both simulated 3D points and in the context of masked token +prediction for proteins. Finally, we show that pre-training protein Transformer +encoders with structure improves performance on a downstream task, yielding +better performance than custom structural models. Together, this work provides +a basis for using standard Transformers as hybrid structure-language models. -摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 +摘要:雖然傳統的 Transformer 通常處理序列資料,但它們可用於結構模型,通常是 SE(3) 不變式或等變式圖神經網路 (GNN),用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而,有證據表明 Transformer 可以自行學習處理結構資訊,例如 AlphaFold3 結構擴散模型。在這項工作中,我們展示了 Transformer 在傳遞座標的線性嵌入時,可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後,我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能,產生比自訂結構模型更好的效能。綜合來說,這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。 + + +### Medical explainable AI +|Publish Date|Title|Authors|Homepage|Code| +| :---: | :---: | :---: | :---: | :---: | +|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| +|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| +|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| +|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| +|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null| +|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)| +|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null| +|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null| +|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v2](http://arxiv.org/abs/2501.09628v2)|null| +|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null| +|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null| +|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null| +|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null| +|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null| +|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)| +|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null| +|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null| +|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null| +|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null| +|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null| +|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null| +|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null| +|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null| +|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null| +|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)| +|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null| +|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null| +|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null| +|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)| +|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)| +|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null| +|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)| +|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null| +|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null| +|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null| +|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null| +|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null| +|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null| +|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null| +|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null| +|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare**|Antonio Rago et.al.|[2408.17401v2](http://arxiv.org/abs/2408.17401v2)|null| +|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null| +|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null| +|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null| +|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null| +|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null| +|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null| +|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null| +|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null| +|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null| +|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null| +|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null| +|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null| +|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null| +|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null| +|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)| +|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null| +|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null| +|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null| +|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null| +|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null| +|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null| +|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)| +|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null| +|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null| +|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null| +|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null| +|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null| +|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)| +|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null| +|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)| +|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null| +|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null| +|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null| +|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null| +|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null| +|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null| +|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null| +|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null| +|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null| +|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null| +|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null| +|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null| +|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)| +|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null| +|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null| +|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null| +|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)| +|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)| +|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null| +|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null| +|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null| +|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null| +|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null| +|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null| +|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null| +|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)| +|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null| +|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null| +|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null| -##### **Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation** -2502.14846v1 by Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark +#### Abstracts +##### **Towards a perturbation-based explanation for medical AI as differentiable programs** +2502.14001v1 by Takeshi Abe, Yoshiyuki Asai -Reasoning about images with rich text, such as charts and documents, is a -critical application of vision-language models (VLMs). However, VLMs often -struggle in these domains due to the scarcity of diverse text-rich -vision-language data. To address this challenge, we present CoSyn, a framework -that leverages the coding capabilities of text-only large language models -(LLMs) to automatically create synthetic text-rich multimodal data. Given input -text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts -an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic -images. With the underlying code as textual representations of the synthetic -images, CoSyn can generate high-quality instruction-tuning data, again relying -on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K -images and 2.7M rows of vision-language instruction-tuning data. Comprehensive -experiments on seven benchmarks demonstrate that models trained on our -synthetic data achieve state-of-the-art performance among competitive -open-source models, including Llama 3.2, and surpass proprietary models such as -GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing -data, enabling VLMs to ground information within input images, showcasing its -potential for developing multimodal agents capable of acting in real-world -environments. +Recent advancement in machine learning algorithms reaches a point where +medical devices can be equipped with artificial intelligence (AI) models for +diagnostic support and routine automation in clinical settings. In medicine and +healthcare, there is a particular demand for sufficient and objective +explainability of the outcome generated by AI models. However, AI models are +generally considered as black boxes due to their complexity, and the +computational process leading to their response is often opaque. Although +several methods have been proposed to explain the behavior of models by +evaluating the importance of each feature in discrimination and prediction, +they may suffer from biases and opacities arising from the scale and sampling +protocol of the dataset used for training or testing. To overcome the +shortcomings of existing methods, we explore an alternative approach to provide +an objective explanation of AI models that can be defined independently of the +learning process and does not require additional data. As a preliminary study +for this direction of research, this work examines a numerical availability of +the Jacobian matrix of deep learning models that measures how stably a model +responses against small perturbations added to the input. The indicator, if +available, are calculated from a trained AI model for a given target input. +This is a first step towards a perturbation-based explanation, which will +assist medical practitioners in understanding and interpreting the response of +the AI model in its clinical application. -摘要:透過豐富文字(例如圖表和文件)對影像進行推理,是視覺語言模型 (VLM) 的重要應用。然而,由於多元化文字豐富的視覺語言資料稀少,VLM 在這些領域中經常會遇到困難。為了應對這個挑戰,我們提出了 CoSyn,一個利用純文字大型語言模型 (LLM) 的編碼能力來自動建立合成文字豐富多模態資料的架構。給定描述目標網域的輸入文字(例如「營養成分標籤」),CoSyn 會提示 LLM 產生用於合成影像渲染的程式碼(Python、HTML、LaTeX 等)。透過將底層程式碼作為合成影像的文字表示,CoSyn 可以產生高品質的指令調整資料,再次依賴純文字 LLM。使用 CoSyn,我們建構了一個包含 40 萬張影像和 270 萬列視覺語言指令調整資料的資料集。在七個基準上的全面實驗證明,在我們的合成資料上訓練的模型在競爭對手的開源模型(包括 Llama 3.2)中達到了最先進的效能,並超越了 GPT-4V 和 Gemini 1.5 Flash 等專有模型。此外,CoSyn 可以產生合成指向資料,讓 VLM 能在輸入影像中建立資訊基礎,展示其在開發能夠在真實世界環境中運作的多模態代理方面的潛力。 +摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 -##### **Revealing and Mitigating Over-Attention in Knowledge Editing** -2502.14838v1 by Pinzheng Wang, Zecheng Tang, Keyan Zhou, Juntao Li, Qiaoming Zhu, Min Zhang +##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** +2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker -Large Language Models have demonstrated superior performance across a wide -range of tasks, but they still exhibit undesirable errors due to incorrect -knowledge learned from the training data. To avoid this, knowledge editing -methods emerged to precisely edit the specific model knowledge via efficiently -modifying a very small percentage of parameters. % However, those methods can -lead to the problem of Specificity Failure: when the content related to the -edited knowledge occurs in the context, it can inadvertently corrupt other -pre-existing knowledge. However, those methods can lead to the problem of -Specificity Failure, where the existing knowledge and capabilities are severely -degraded due to editing. Our preliminary indicates that Specificity Failure -primarily stems from the model's attention heads assigning excessive attention -scores to entities related to the edited knowledge, thereby unduly focusing on -specific snippets within the context, which we denote as the Attention Drift -phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet -effective method Selective Attention Drift Restriction}(SADR), which introduces -an additional regularization term during the knowledge editing process to -restrict changes in the attention weight distribution, thereby preventing undue -focus on the edited entity. Experiments on five frequently used strong LLMs -demonstrate the effectiveness of our method, where SADR can significantly -mitigate Specificity Failure in the predominant knowledge editing tasks. +Explainability remains a significant problem for AI models in medical +imaging, making it challenging for clinicians to trust AI-driven predictions. +We introduce 3D ReX, the first causality-based post-hoc explainability tool for +3D models. 3D ReX uses the theory of actual causality to generate +responsibility maps which highlight the regions most crucial to the model's +decision. We test 3D ReX on a stroke detection model, providing insight into +the spatial distribution of features relevant to stroke. -摘要:大型語言模型已在廣泛任務中展現出卓越的效能,但由於從訓練資料中學習到不正確的知識,它們仍會出現令人不滿意的錯誤。為避免此情況,知識編輯方法應運而生,透過有效修改極少數參數來精準編輯特定模型知識。% 然而,這些方法可能會導致特異性失敗問題:當與已編輯知識相關的內容出現在文中時,可能會無意間損害其他既有知識。然而,這些方法可能會導致特異性失敗問題,因為現有知識和能力會因編輯而嚴重降低。我們的初步研究表明,特異性失敗主要源於模型的注意力權重將過度注意力分數分配給與已編輯知識相關的實體,從而過度關注文中特定的片段,我們將此現象稱為注意力偏移。為減輕這種注意力偏移問題,我們引入了一個簡單但有效的方法選擇性注意力偏移限制}(SADR),在知識編輯過程中引入一個額外的正則化項來限制注意力權重分配的變動,從而防止過度關注已編輯實體。在五個經常使用的強大 LLM 上進行的實驗證明了我們方法的有效性,其中 SADR 可以顯著減輕主要知識編輯任務中的特異性失敗。 +摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 +我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 -##### **Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs** -2502.14837v1 by Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui +##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** +2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano -Multi-head Latent Attention (MLA) is an innovative architecture proposed by -DeepSeek, designed to ensure efficient and economical inference by -significantly compressing the Key-Value (KV) cache into a latent vector. -Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its -variants such as Grouped-Query Attention (GQA) exhibit significant cost -disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA -without pre-training from scratch is both meaningful and challenging. This -paper proposes the first data-efficient fine-tuning method for transitioning -from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, -we remove RoPE from dimensions of queries and keys that contribute less to the -attention scores, for low-rank approximation, we introduce joint SVD -approximations based on the pre-trained parameters of keys and values. These -carefully designed strategies enable MHA2MLA to recover performance using only -a small fraction (0.3% to 0.6%) of the data, significantly reducing inference -costs while seamlessly integrating with compression techniques such as KV cache -quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, -with only a 0.5% drop in LongBench performance. +This paper presents a complete explainable system that interprets a set of +data, abstracts the underlying features and describes them in a natural +language of choice. The system relies on two crucial stages: (i) identifying +emerging properties from data and transforming them into abstract concepts, and +(ii) converting these concepts into natural language. Despite the impressive +natural language generation capabilities demonstrated by Large Language Models, +their statistical nature and the intricacy of their internal mechanism still +force us to employ these techniques as black boxes, forgoing trustworthiness. +Developing an explainable pipeline for data interpretation would allow +facilitating its use in safety-critical environments like processing medical +information and allowing non-experts and visually impaired people to access +narrated information. To this end, we believe that the fields of knowledge +representation and automated reasoning research could present a valid +alternative. Expanding on prior research that tackled the first stage (i), we +focus on the second stage, named Concept2Text. Being explainable, data +translation is easily modeled through logic-based rules, once again emphasizing +the role of declarative programming in achieving AI explainability. This paper +explores a Prolog/CLP-based rewriting system to interpret concepts-articulated +in terms of classes and relations, plus common knowledge-derived from a generic +ontology, generating natural language text. Its main features include +hierarchical tree rewritings, modular multilingual generation, support for +equivalent variants across semantic, grammar, and lexical levels, and a +transparent rule-based system. We outline the architecture and demonstrate its +flexibility through some examples capable of generating numerous diverse and +equivalent rewritings based on the input concept. -摘要:多頭潛在注意力 (MLA) 是 DeepSeek 提出的一種創新架構,旨在通過將鍵值 (KV) 快取大幅壓縮成潛在向量,確保有效率且經濟的推論。與 MLA 相比,採用多頭注意力 (MHA) 及其變體(例如分組查詢注意力 (GQA))的標準 LLM 會出現顯著的成本劣勢。讓訓練完善的 LLM(例如 Llama)能夠快速適應 MLA,而無需從頭開始預訓練,這既有意義又具有挑戰性。本文提出了第一個資料有效微調方法,用於從 MHA 轉換到 MLA (MHA2MLA),其中包含兩個關鍵組成部分:對於部分 RoPE,我們從查詢和鍵的維度中移除對注意力分數貢獻較小的 RoPE,對於低秩近似,我們基於鍵和值的預訓練參數引入聯合 SVD 近似。這些經過仔細設計的策略讓 MHA2MLA 能夠僅使用一小部分資料 (0.3% 至 0.6%) 來恢復效能,大幅降低推論成本,同時與壓縮技術(例如 KV 快取量化)無縫整合。例如,Llama2-7B 的 KV 快取大小減少了 92.19%,而 LongBench 效能僅下降了 0.5%。 +摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 -##### **LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models** -2502.14834v1 by Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li +##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** +2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek -Existing Large Vision-Language Models (LVLMs) can process inputs with context -lengths up to 128k visual and text tokens, yet they struggle to generate -coherent outputs beyond 1,000 words. We find that the primary limitation is the -absence of long output examples during supervised fine-tuning (SFT). To tackle -this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 -examples, each with multiple input images, an instruction, and corresponding -outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that -maintain high-fidelity to the input images, we employ Direct Preference -Optimization (DPO) to the SFT model. Given the high cost of collecting human -feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which -breaks long outputs into segments and uses iterative corrections to form -preference pairs with the original outputs. Additionally, we develop -MMLongBench-Write, a benchmark featuring six tasks to evaluate the -long-generation capabilities of VLMs. Our 7B parameter model, trained with -LongWriter-V-22k and IterDPO, achieves impressive performance on this -benchmark, outperforming larger proprietary models like GPT-4o. Code and data: -https://github.com/THU-KEG/LongWriter-V +We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), +an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS +predicts future PHTs using transformer-based architectures. The Adaptive Risk +Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk +probabilities for clinician-defined critical events. ARES incorporates a +personalized explainability module that identifies key clinical factors +influencing risk estimates for individual patients. ARES was evaluated on the +MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its +performance against traditional early warning systems and machine learning +models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, +with 60% including hospital admissions. The dataset contained over 357 million +tokens. ETHOS outperformed benchmark models in predicting hospital admissions, +ICU admissions, and prolonged hospital stays, achieving superior AUC scores. +ETHOS-based risk estimates demonstrated robustness across demographic subgroups +with strong model reliability, confirmed via calibration curves. The +personalized explainability module provides insights into patient-specific +factors contributing to risk. ARES, powered by ETHOS, advances predictive +healthcare AI by providing dynamic, real-time, and personalized risk estimation +with patient-specific explainability to enhance clinician trust. Its +adaptability and superior accuracy position it as a transformative tool for +clinical decision-making, potentially improving patient outcomes and resource +allocation in emergency and inpatient settings. We release the full code at +github.com/ipolharvard/ethos-ares to facilitate future research. -摘要:現有的大型視覺語言模型 (LVLMs) 能處理長度達 128k 視覺和文字符號的輸入內容,但卻難以產生超過 1,000 字的連貫輸出。我們發現,主要限制在於監督微調 (SFT) 期間缺少長輸出範例。為了解決此問題,我們引入了 LongWriter-V-22k,這是一個 SFT 資料集,包含 22,158 個範例,每個範例都有多個輸入影像、一個說明和對應的輸出,範圍從 0 到 10,000 字。此外,為了產生與輸入影像高度保真的長輸出,我們對 SFT 模型採用直接偏好最佳化 (DPO)。考量到收集人類回饋的成本很高(例如 3,000 字),我們提出 IterDPO,它會將長輸出區分成幾個區塊,並使用反覆修正來形成與原始輸出的偏好配對。此外,我們開發了 MMLongBench-Write,這是一個基準,包含六項任務,用於評估 VLM 的長生成能力。我們的 7B 參數模型使用 LongWriter-V-22k 和 IterDPO 進行訓練,在這個基準上取得令人印象深刻的效能,超越了 GPT-4o 等大型專有模型。程式碼和資料:https://github.com/THU-KEG/LongWriter-V +摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), +一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS +使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 -##### **Improving the Diffusability of Autoencoders** -2502.14831v1 by Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin +##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases** +2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq -Latent diffusion models have emerged as the leading approach for generating -high-quality images and videos, utilizing compressed latent representations to -reduce the computational burden of the diffusion process. While recent -advancements have primarily focused on scaling diffusion backbones and -improving autoencoder reconstruction quality, the interaction between these -components has received comparatively less attention. In this work, we perform -a spectral analysis of modern autoencoders and identify inordinate -high-frequency components in their latent spaces, which are especially -pronounced in the autoencoders with a large bottleneck channel size. We -hypothesize that this high-frequency component interferes with the -coarse-to-fine nature of the diffusion synthesis process and hinders the -generation quality. To mitigate the issue, we propose scale equivariance: a -simple regularization strategy that aligns latent and RGB spaces across -frequencies by enforcing scale equivariance in the decoder. It requires minimal -code changes and only up to 20K autoencoder fine-tuning steps, yet -significantly improves generation quality, reducing FID by 19% for image -generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation -on Kinetics-700 17x256x256. +This study addresses a critical gap in the healthcare system by developing a +clinically meaningful, practical, and explainable disease surveillance system +for multiple chronic diseases, utilizing routine EHR data from multiple U.S. +practices integrated with CureMD's EMR/EHR system. Unlike traditional +systems--using AI models that rely on features from patients' labs--our +approach focuses on routinely available data, such as medical history, vitals, +diagnoses, and medications, to preemptively assess the risks of chronic +diseases in the next year. We trained three distinct models for each chronic +disease: prediction models that forecast the risk of a disease 3, 6, and 12 +months before a potential diagnosis. We developed Random Forest models, which +were internally validated using F1 scores and AUROC as performance metrics and +further evaluated by a panel of expert physicians for clinical relevance based +on inferences grounded in medical knowledge. Additionally, we discuss our +implementation of integrating these models into a practical EMR system. Beyond +using Shapley attributes and surrogate models for explainability, we also +introduce a new rule-engineering framework to enhance the intrinsic +explainability of Random Forests. -摘要:潛在擴散模型已成為生成高品質影像和影片的主流方法,利用壓縮潛在表示來降低擴散過程的計算負擔。雖然近期的進展主要集中在擴充擴散主幹並提升自編碼器重建品質,但這些組成之間的交互作用卻鮮少受到關注。在這項研究中,我們對現代自編碼器進行頻譜分析,並在它們的潛在空間中找出不適當的高頻率組成,這在瓶頸通道尺寸較大的自編碼器中特別明顯。我們假設這種高頻率組成會干擾擴散合成過程由粗到細的性質,並阻礙生成品質。為了緩解這個問題,我們提出規模等變性:一種簡單的正則化策略,透過在解碼器中強制執行規模等變性,使潛在空間和 RGB 空間在各個頻率中保持一致。它只需要最小的程式碼變更,且僅需最多 20K 個自編碼器微調步驟,就能顯著提升生成品質,將 ImageNet-1K 256x256 上的影像生成的 FID 降低 19%,並將 Kinetics-700 17x256x256 上的影片生成的 FVD 降低至少 44%。 +摘要:本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統,來解決醫療保健系統中的重大缺口,利用整合 CureMD 的 EMR/EHR 系統,來自多個美國實務的例行 EHR 資料。與傳統系統不同的是,我們的做法著重在例行可得的資料,例如病歷、生命徵象、診斷和藥物,以預先評估未來一年慢性疾病的風險,而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型:預測模型,用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型,並使用 F1 分數和 AUROC 作為效能指標,進行內部驗證,並進一步由專家醫師小組根據植基於醫學知識的推論,評估其臨床相關性。此外,我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外,我們還引進了一個新的規則工程架構,以增強隨機森林的內在可解釋性。 -##### **Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs** -2502.14830v1 by Danni Liu, Jan Niehues +##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data** +2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek -While large language models demonstrate remarkable capabilities at -task-specific applications through fine-tuning, extending these benefits across -diverse languages is essential for broad accessibility. However, effective -cross-lingual transfer is hindered by LLM performance gaps across languages and -the scarcity of fine-tuning data in many languages. Through analysis of LLM -internal representations from over 1,000+ language pairs, we discover that -middle layers exhibit the strongest potential for cross-lingual alignment. -Building on this finding, we propose a middle-layer alignment objective -integrated into task-specific training. Our experiments on slot filling, -machine translation, and structured text generation show consistent -improvements in cross-lingual transfer, especially to lower-resource languages. -The method is robust to the choice of alignment languages and generalizes to -languages unseen during alignment. Furthermore, we show that separately trained -alignment modules can be merged with existing task-specific modules, improving -cross-lingual capabilities without full re-training. Our code is publicly -available (https://github.com/dannigt/mid-align). +Deep neural networks are increasingly employed in high-stakes medical +applications, despite their tendency for shortcut learning in the presence of +spurious correlations, which can have potentially fatal consequences in +practice. Detecting and mitigating shortcut behavior is a challenging task that +often requires significant labeling efforts from domain experts. To alleviate +this problem, we introduce a semi-automated framework for the identification of +spurious behavior from both data and model perspective by leveraging insights +from eXplainable Artificial Intelligence (XAI). This allows the retrieval of +spurious data points and the detection of model circuits that encode the +associated prediction rules. Moreover, we demonstrate how these shortcut +encodings can be used for XAI-based sample- and pixel-level data annotation, +providing valuable information for bias mitigation methods to unlearn the +undesired shortcut behavior. We show the applicability of our framework using +four medical datasets across two modalities, featuring controlled and +real-world spurious correlations caused by data artifacts. We successfully +identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision +Transformer models, ultimately increasing their robustness and applicability +for real-world medical tasks. -摘要:儘管大型語言模型在特定任務應用中透過微調展現出卓越的能力,但要讓這些好處擴及各種語言,對於廣泛的可及性來說至關重要。然而,有效的跨語言轉移受到跨語言 LLM 效能差距以及許多語言中微調資料的稀少性所阻礙。透過分析來自 1,000 多種語言對的 LLM 內部表示,我們發現中間層展現出最強的跨語言對齊潛力。根據這個發現,我們提出一個整合到特定任務訓練中的中間層對齊目標。我們在插槽填補、機器翻譯和結構化文字生成方面的實驗顯示,跨語言轉移持續改善,特別是對於低資源語言。此方法對於對齊語言的選擇具有穩健性,並推廣到對齊期間未曾見過的語言。此外,我們展示了單獨訓練的對齊模組可以與現有的特定任務模組合併,在不重新訓練的情況下改善跨語言能力。我們的程式碼已公開(https://github.com/dannigt/mid-align)。 +摘要:深度神经网络越来越多地用于高风险医疗应用中,尽管它们在存在虚假相关性的情况下倾向于捷径学习,这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务,通常需要领域专家的大量标记工作。为了缓解这个问题,我们引入了一个半自动框架,用于从数据和模型的角度识别虚假行为,方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外,我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释,为偏差缓解方法提供有价值的信息,以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性,这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差,最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。 -##### **Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps** -2502.14829v1 by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasović, Yonatan Belinkov +##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model** +2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail -When prompted to think step-by-step, language models (LMs) produce a chain of -thought (CoT), a sequence of reasoning steps that the model supposedly used to -produce its prediction. However, despite much work on CoT prompting, it is -unclear if CoT reasoning is faithful to the models' parameteric beliefs. We -introduce a framework for measuring parametric faithfulness of generated -reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an -instance of this framework. FUR erases information contained in reasoning steps -from model parameters. We perform experiments unlearning CoTs of four LMs -prompted on four multi-choice question answering (MCQA) datasets. Our -experiments show that FUR is frequently able to change the underlying models' -prediction by unlearning key steps, indicating when a CoT is parametrically -faithful. Further analysis shows that CoTs generated by models post-unlearning -support different answers, hinting at a deeper effect of unlearning. -Importantly, CoT steps identified as important by FUR do not align well with -human notions of plausbility, emphasizing the need for specialized alignment +Suicidal ideation detection is crucial for preventing suicides, a leading +cause of death worldwide. Many individuals express suicidal thoughts on social +media, offering a vital opportunity for early detection through advanced +machine learning techniques. The identification of suicidal ideation in social +media text is improved by utilising a hybrid framework that integrates +Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory +(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability +of the model's predictions, Explainable AI (XAI) methods are applied, with a +particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At +first, the model managed to reach an accuracy of 92.81%. By applying +fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The +SHAP analysis revealed key features influencing the model's predictions, such +as terms related to mental health struggles. This level of transparency boosts +the model's credibility while helping mental health professionals understand +and trust the predictions. This work highlights the potential for improving the +accuracy and interpretability of detecting suicidal tendencies, making a +valuable contribution to the progress of mental health monitoring systems. It +emphasizes the significance of blending powerful machine learning methods with +explainability to develop reliable and impactful mental health solutions. -摘要:当提示逐步思考时,语言模型 (LM) 会产生一系列思考 (CoT),这是模型用来产生预测的一系列推理步骤。然而,尽管在 CoT 提示上做了很多工作,但尚不清楚 CoT 推理是否符合模型的参数化信念。我们引入了一个框架来衡量生成推理的参数化保真度,并提出了通过取消学习推理步骤 (FUR) 的保真度,这是该框架的一个实例。FUR 从模型参数中擦除推理步骤中包含的信息。我们执行实验,取消学习提示在四个多项选择问答 (MCQA) 数据集上的四个 LM 的 CoT。我们的实验表明,FUR 经常能够通过取消学习关键步骤来改变底层模型的预测,表明 CoT 在参数上是保真的。进一步的分析表明,模型在取消学习后生成的 CoT 支持不同的答案,暗示取消学习具有更深层次的影响。重要的是,FUR 确定的 CoT 步骤与人类对合理性的概念不太一致,强调了专门对齐的必要性 +摘要:自殺意念偵測對於預防自殺至關重要,而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭,這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構,並加入注意力機制,可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性,我們採用可解釋人工智慧 (XAI) 方法,特別著重於 SHapley 加法解釋 (SHAP)。一開始,模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術,準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵,例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度,同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力,為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。 -##### **Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison** -2502.14827v1 by Aiswarya Baby, Tintu Thankom Koshy +##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights** +2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet -Visual Question Answering (VQA) has emerged as a pivotal task in the -intersection of computer vision and natural language processing, requiring -models to understand and reason about visual content in response to natural -language questions. Analyzing VQA datasets is essential for developing robust -models that can handle the complexities of multimodal reasoning. Several -approaches have been developed to examine these datasets, each offering -distinct perspectives on question diversity, answer distribution, and -visual-textual correlations. Despite significant progress, existing VQA models -face challenges related to dataset bias, limited model complexity, commonsense -reasoning gaps, rigid evaluation methods, and generalization to real world -scenarios. This paper presents a comprehensive comparative study of five -advanced VQA models: ABC-CNN, KICNLE, Masked Vision and Language Modeling, -BLIP-2, and OFA, each employing distinct methodologies to address these -challenges. +In epidemiology, traditional statistical methods such as logistic regression, +linear regression, and other parametric models are commonly employed to +investigate associations between predictors and health outcomes. However, +non-parametric machine learning techniques, such as deep neural networks +(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for +this task. Despite their potential, these methods face challenges due to the +limited availability of high-quality, high-quantity data in this field. To +address these challenges, we introduce SEANN, a novel approach for informed +DNNs that leverages a prevalent form of domain-specific knowledge: Pooled +Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies, +in different forms, and represent a quantitative form of a scientific +consensus. By direct integration within the learning procedure using a custom +loss, we experimentally demonstrate significant improvements in the +generalizability of predictive performances and the scientific plausibility of +extracted relationships compared to a domain-knowledge agnostic neural network +in a scarce and noisy data setting. -摘要:視覺問答 (VQA) 已成為電腦視覺與自然語言處理交會中的關鍵任務,要求模型理解和推理視覺內容以回應自然語言問題。分析 VQA 資料集對於開發健全的模型至關重要,這些模型能夠處理多模態推理的複雜性。已經開發出多種方法來檢驗這些資料集,每種方法都提供有關問題多樣性、答案分佈和視覺文本關聯性的不同觀點。儘管有顯著進展,現有的 VQA 模型仍面臨與資料集偏差、模型複雜性有限、常識推理差距、僵化的評估方法和推廣到現實世界場景相關的挑戰。本文對五個先進的 VQA 模型進行了全面的比較研究:ABC-CNN、KICNLE、Masked Vision and Language Modeling、BLIP-2 和 OFA,每個模型都採用不同的方法來應對這些挑戰。 +摘要:在流行病學中,傳統的統計方法,例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而,非參數機器學習技術,例如深度神經網路 (DNN),結合可解釋的 AI (XAI) 工具,為這項任務提供了新的機會。儘管這些方法具有潛力,但由於該領域缺乏高品質、高數量資料,因此這些方法面臨挑戰。為了應對這些挑戰,我們引入了 SEANN,這是一種新穎的方法,用於獲取知識的 DNN,它利用了一種流行的領域特定知識形式:彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中,並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中,我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比,科學合理性的顯著提升,且是在稀少且有雜訊的資料設定中。 -##### **eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables** -2502.14820v1 by Luis Antonio Gutiérrez Guanilo, Mir Tafseer Nayeem, Cristian López, Davood Rafiei +##### **Artificial Intelligence-Driven Clinical Decision Support Systems** +2501.09628v2 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni -Large Language Models (LLMs) have demonstrated exceptional versatility across -diverse domains, yet their application in e-commerce remains underexplored due -to a lack of domain-specific datasets. To address this gap, we introduce -eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, -including detailed product attributes and user-specific queries. Leveraging -eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to -produce high-quality, attribute-specific product reviews from structured -tabular data. Fine-tuned models were rigorously evaluated using standard -Table2Text metrics, alongside correctness, faithfulness, and fluency -assessments. Our results demonstrate substantial improvements in generating -contextually accurate reviews, highlighting the transformative potential of -tailored datasets and fine-tuning methodologies in optimizing e-commerce -workflows. This work highlights the potential of LLMs in e-commerce workflows -and the essential role of domain-specific datasets in tailoring them to -industry-specific challenges. +As artificial intelligence (AI) becomes increasingly embedded in healthcare +delivery, this chapter explores the critical aspects of developing reliable and +ethical Clinical Decision Support Systems (CDSS). Beginning with the +fundamental transition from traditional statistical models to sophisticated +machine learning approaches, this work examines rigorous validation strategies +and performance assessment methods, including the crucial role of model +calibration and decision curve analysis. The chapter emphasizes that creating +trustworthy AI systems in healthcare requires more than just technical +accuracy; it demands careful consideration of fairness, explainability, and +privacy. The challenge of ensuring equitable healthcare delivery through AI is +stressed, discussing methods to identify and mitigate bias in clinical +predictive models. The chapter then delves into explainability as a cornerstone +of human-centered CDSS. This focus reflects the understanding that healthcare +professionals must not only trust AI recommendations but also comprehend their +underlying reasoning. The discussion advances in an analysis of privacy +vulnerabilities in medical AI systems, from data leakage in deep learning +models to sophisticated attacks against model explanations. The text explores +privacy-preservation strategies such as differential privacy and federated +learning, while acknowledging the inherent trade-offs between privacy +protection and model performance. This progression, from technical validation +to ethical considerations, reflects the multifaceted challenges of developing +AI systems that can be seamlessly and reliably integrated into daily clinical +practice while maintaining the highest standards of patient care and data +protection. -摘要:大型語言模型 (LLM) 在各種領域展現出非凡的多功能性,但由於缺乏特定領域的資料集,因此它們在電子商務中的應用仍未得到充分探索。為了解決這個差距,我們引入了 eC-Tab2Text,這是一個新穎的資料集,旨在捕捉電子商務的複雜性,包括詳細的產品屬性和使用者特定的查詢。利用 eC-Tab2Text,我們專注於從產品表格中產生文字,使 LLM 能夠從結構化的表格資料中產生高品質、特定屬性的產品評論。微調模型使用標準的 Table2Text 指標,以及正確性、忠實度和流利度評估進行嚴格評估。我們的結果證明在產生符合語境的準確評論方面有顯著的進步,突顯了客製化資料集和微調方法在最佳化電子商務工作流程中的轉型潛力。這項工作突顯了 LLM 在電子商務工作流程中的潛力,以及特定領域資料集在因應產業特定挑戰中至關重要的角色。 +摘要:隨著人工智慧(AI)在醫療保健服務中日益普及,本章探討了開發可靠且符合道德的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型轉變到複雜機器學習方法的基本原理開始,這項工作探討了嚴謹的驗證策略和效能評估方法,包括模型校準和決策曲線分析的關鍵角色。本章強調,在醫療保健中建立值得信賴的 AI 系統不僅需要技術準確性;它需要仔細考量公平性、可解釋性和隱私。本章強調了透過 AI 確保公平醫療保健服務的挑戰,並討論了識別和減輕臨床預測模型中偏差的方法。接著,本章深入探討可解釋性作為以人為中心的 CDSS 的基石。這種關注反映了對醫療保健專業人員不僅必須信任 AI 建議,還必須理解其背後推理的理解。討論進展到對醫療 AI 系統中隱私漏洞的分析,從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略,例如差分隱私和聯合學習,同時承認隱私保護和模型效能之間的固有權衡。從技術驗證到道德考量,這種進展反映了開發 AI 系統的多方面挑戰,這些系統可以無縫且可靠地整合到日常臨床實務中,同時維持最高標準的患者照護和資料保護。 -##### **Optimizing Model Selection for Compound AI Systems** -2502.14815v1 by Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica +##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis** +2501.06887v1 by Sadia Kamal, Tim Oates -Compound AI systems that combine multiple LLM calls, such as self-refine and -multi-agent-debate, achieve strong performance on many AI tasks. We address a -core question in optimizing compound systems: for each LLM call or module in -the system, how should one decide which LLM to use? We show that these LLM -choices have a large effect on quality, but the search space is exponential. We -propose LLMSelector, an efficient framework for model selection in compound -systems, which leverages two key empirical insights: (i) end-to-end performance -is often monotonic in how well each module performs, with all other modules -held fixed, and (ii) per-module performance can be estimated accurately by an -LLM. Building upon these insights, LLMSelector iteratively selects one module -and allocates to it the model with the highest module-wise performance, as -estimated by an LLM, until no further gain is possible. LLMSelector is -applicable to any compound system with a bounded number of modules, and its -number of API calls scales linearly with the number of modules, achieving -high-quality model allocation both empirically and theoretically. Experiments -with popular compound systems such as multi-agent debate and self-refine using -LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector -confers 5%-70% accuracy gains compared to using the same LLM for all modules. +As deep learning models gain attraction in medical data, ensuring transparent +and trustworthy decision-making is essential. In skin cancer diagnosis, while +advancements in lesion detection and classification have improved accuracy, the +black-box nature of these methods poses challenges in understanding their +decision processes, leading to trust issues among physicians. This study +leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on +different skin lesion datasets, to capture meaningful relationships between +visual features and diagnostic criteria terms. To further enhance transparency, +we propose a method called MedGrad E-CLIP, which builds on gradient-based +E-CLIP by incorporating a weighted entropy mechanism designed for complex +medical imaging like skin lesions. This approach highlights critical image +regions linked to specific diagnostic descriptions. The developed integrated +pipeline not only classifies skin lesions by matching corresponding +descriptions but also adds an essential layer of explainability developed +especially for medical data. By visually explaining how different features in +an image relates to diagnostic criteria, this approach demonstrates the +potential of advanced vision-language models in medical image analysis, +ultimately improving transparency, robustness, and trust in AI-driven +diagnostic systems. -摘要:複合式 AI 系統結合多個 LLM 呼叫,例如自我精煉和多代理辯論,在許多 AI 任務中都能獲得強大的效能。我們解決了最佳化複合式系統中的核心問題:對於系統中的每個 LLM 呼叫或模組,應該如何決定要使用哪個 LLM?我們表明這些 LLM 選擇對品質有很大的影響,但搜尋空間是呈指數增長的。我們提出 LLMSelector,一種用於複合式系統中模型選擇的有效架構,它利用了兩個主要的經驗見解:(i) 端對端效能通常會隨著每個模組執行得有多好而單調變化,而其他所有模組保持固定,以及 (ii) 每個模組的效能都可以由 LLM 精準估計。LLMSelector 建立在這些見解之上,反覆選擇一個模組,並根據 LLM 估計的模組最佳效能,將模型分配給它,直到無法再進一步提升為止。LLMSelector 適用於任何具有有限數量的模組的複合式系統,其 API 呼叫數量與模組數量成線性比例,在經驗和理論上都實現了高品質的模型配置。使用 GPT-4o、Claude 3.5 Sonnet 和 Gemini 1.5 等 LLM,對多代理辯論和自我精煉等熱門複合式系統進行的實驗表明,與對所有模組使用相同的 LLM 相比,LLMSelector 可帶來 5%-70% 的準確度提升。 +摘要:随着深度学习模型在医学数据中获得关注,确保透明且值得信赖的决策至关重要。在皮肤癌诊断中,虽然病灶检测和分类的进步提高了准确性,但这些方法的黑盒性质对理解其决策过程构成了挑战,导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP(对比语言图像预训练)模型,以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度,我们提出了一种名为 MedGrad E-CLIP 的方法,该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制,建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类,还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系,这种方法展示了高级视觉语言模型在医学图像分析中的潜力,最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。 -##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** -2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub +##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis** +2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat -Foundation models are becoming increasingly effective in the medical domain, -offering pre-trained models on large datasets that can be readily adapted for -downstream tasks. Despite progress, fetal ultrasound images remain a -challenging domain for foundation models due to their inherent complexity, -often requiring substantial additional training and facing limitations due to -the scarcity of paired multimodal data. To overcome these challenges, here we -introduce FetalCLIP, a vision-language foundation model capable of generating -universal representation of fetal ultrasound images. FetalCLIP was pre-trained -using a multimodal learning approach on a diverse dataset of 210,035 fetal -ultrasound images paired with text. This represents the largest paired dataset -of its kind used for foundation model development to date. This unique training -approach allows FetalCLIP to effectively learn the intricate anatomical -features present in fetal ultrasound images, resulting in robust -representations that can be used for a variety of downstream applications. In -extensive benchmarking across a range of key fetal ultrasound applications, -including classification, gestational age estimation, congenital heart defect -(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all -baselines while demonstrating remarkable generalizability and strong -performance even with limited labeled data. We plan to release the FetalCLIP -model publicly for the benefit of the broader scientific community. +Humour styles can have either a negative or a positive impact on well-being. +Given the importance of these styles to mental health, significant research has +been conducted on their automatic identification. However, the automated +machine learning models used for this purpose are black boxes, making their +prediction decisions opaque. Clarity and transparency are vital in the field of +mental health. This paper presents an explainable AI (XAI) framework for +understanding humour style classification, building upon previous work in +computational humour analysis. Using the best-performing single model +(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to +analyse how linguistic, emotional, and semantic features contribute to humour +style classification decisions. Our analysis reveals distinct patterns in how +different humour styles are characterised and misclassified, with particular +emphasis on the challenges in distinguishing affiliative humour from other +styles. Through detailed examination of feature importance, error patterns, and +misclassification cases, we identify key factors influencing model decisions, +including emotional ambiguity, context misinterpretation, and target +identification. The framework demonstrates significant utility in understanding +model behaviour, achieving interpretable insights into the complex interplay of +features that define different humour styles. Our findings contribute to both +the theoretical understanding of computational humour analysis and practical +applications in mental health, content moderation, and digital humanities +research. -摘要:基礎模型在醫療領域正變得越來越有效, -提供在大型資料集上預先訓練的模型,可輕鬆適應 -下游任務。儘管有進展,但胎兒超音波影像仍然是 -基礎模型的挑戰領域,因為它們固有的複雜性, -通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 -介紹 FetalCLIP,一種能夠產生 -胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 -超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 -方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 -表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 -(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 -效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 +摘要:幽默風格對幸福感可能產生負面或正面的影響。 +鑑於這些風格對心理健康的重要性,已經對其自動識別進行了大量研究。然而,用於此目的的自動機器學習模型是黑盒子,使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架,用於理解幽默風格分類,建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost),我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式,特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例,我們確定了影響模型決策的關鍵因素,包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用,實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。 + +##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support** +2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani + +The increasing demand for mental health services has highlighted the need for +innovative solutions, particularly in the realm of psychological conversational +AI, where the availability of sensitive data is scarce. In this work, we +explored the development of a system tailored for mental health support with a +novel approach to psychological assessment based on explainable emotional +profiles in combination with empathetic conversational models, offering a +promising tool for augmenting traditional care, particularly where immediate +expertise is unavailable. Our work can be divided into two main parts, +intrinsecaly connected to each other. First, we present RACLETTE, a +conversational system that demonstrates superior emotional accuracy compared to +state-of-the-art benchmarks in both understanding users' emotional states and +generating empathetic responses during conversations, while progressively +building an emotional profile of the user through their interactions. Second, +we show how the emotional profiles of a user can be used as interpretable +markers for mental health assessment. These profiles can be compared with +characteristic emotional patterns associated with different mental disorders, +providing a novel approach to preliminary screening and support. -##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** -2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su +摘要:隨著對心理健康服務需求的增加,凸顯了創新解決方案的需求,特別是在心理對話式人工智慧領域,那裡缺乏敏感資料。在這項工作中,我們探索了開發一個針對心理健康支持的系統,採用一種基於可解釋的情緒特徵的新方法進行心理評估,結合同理心對話模式,提供了一個有前途的工具,用於擴充傳統照護,特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分,彼此內在相關。首先,我們展示了 RACLETTE,一個對話系統,與最先進的基準相比,在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性,同時透過他們的互動逐漸建立使用者的情緒特徵。其次,我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較,提供了一種初步篩選和支持的新方法。 -Our ability to continuously acquire, organize, and leverage knowledge is a -key feature of human intelligence that AI systems must approximate to unlock -their full potential. Given the challenges in continual learning with large -language models (LLMs), retrieval-augmented generation (RAG) has become the -dominant way to introduce new information. However, its reliance on vector -retrieval hinders its ability to mimic the dynamic and interconnected nature of -human long-term memory. Recent RAG approaches augment vector embeddings with -various structures like knowledge graphs to address some of these gaps, namely -sense-making and associativity. However, their performance on more basic -factual memory tasks drops considerably below standard RAG. We address this -unintended deterioration and propose HippoRAG 2, a framework that outperforms -standard RAG comprehensively on factual, sense-making, and associative memory -tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in -HippoRAG and enhances it with deeper passage integration and more effective -online use of an LLM. This combination pushes this RAG system closer to the -effectiveness of human long-term memory, achieving a 7% improvement in -associative memory tasks over the state-of-the-art embedding model while also -exhibiting superior factual knowledge and sense-making memory capabilities. -This work paves the way for non-parametric continual learning for LLMs. Our -code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. +##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation** +2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia -摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 +Artificial intelligence (AI) has emerged as a powerful tool to enhance +decision-making and optimize treatment protocols in in vitro fertilization +(IVF). In particular, AI shows significant promise in supporting +decision-making during the ovarian stimulation phase of the IVF process. This +review evaluates studies focused on the applications of AI combined with +medical imaging in ovarian stimulation, examining methodologies, outcomes, and +current limitations. Our analysis of 13 studies on this topic reveals that, +reveal that while AI algorithms demonstrated notable potential in predicting +optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the +medical imaging data utilized predominantly came from two-dimensional (2D) +ultrasound which mainly involved basic quantifications, such as follicle size +and number, with limited use of direct feature extraction or advanced image +analysis techniques. This points to an underexplored opportunity where advanced +image analysis approaches, such as deep learning, and more diverse imaging +modalities, like three-dimensional (3D) ultrasound, could unlock deeper +insights. Additionally, the lack of explainable AI (XAI) in most studies raises +concerns about the transparency and traceability of AI-driven decisions - key +factors for clinical adoption and trust. Furthermore, many studies relied on +single-center designs and small datasets, which limit the generalizability of +their findings. This review highlights the need for integrating advanced +imaging analysis techniques with explainable AI methodologies, as well as the +importance of leveraging multicenter collaborations and larger datasets. +Addressing these gaps has the potential to enhance ovarian stimulation +management, paving the way for efficient, personalized, and data-driven +treatment pathways that improve IVF outcomes. -##### **A Survey on Text-Driven 360-Degree Panorama Generation** -2502.14799v1 by Hai Wang, Xiaoyu Xiang, Weihao Xia, Jing-Hao Xue +摘要:人工智慧(AI)已成為增強體外受精(IVF)決策制定和優化治療方案的強大工具。特別是,AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示,雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力,但所利用的醫學影像數據主要來自於二次元(2D)超音波,而二次元超音波主要涉及基本量化,例如濾泡大小和數量,且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會,例如深度學習等進階影像分析方法,以及更多元的影像模式,例如三維(3D)超音波,可以解鎖更深入的見解。此外,大多數研究缺乏可解釋 AI(XAI),這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂,而透明度和可追溯性是臨床採用和信任的關鍵因素。此外,許多研究依賴於單中心設計和小型數據集,這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性,以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理,為有效、個人化和數據驅動的治療途徑鋪平道路,進而改善 IVF 結果。 -The advent of text-driven 360-degree panorama generation, enabling the -synthesis of 360-degree panoramic images directly from textual descriptions, -marks a transformative advancement in immersive visual content creation. This -innovation significantly simplifies the traditionally complex process of -producing such content. Recent progress in text-to-image diffusion models has -accelerated the rapid development in this emerging field. This survey presents -a comprehensive review of text-driven 360-degree panorama generation, offering -an in-depth analysis of state-of-the-art algorithms and their expanding -applications in 360-degree 3D scene generation. Furthermore, we critically -examine current limitations and propose promising directions for future -research. A curated project page with relevant resources and research papers is -available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/. +##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models** +2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman -摘要:文字驅動 360 度全景圖生成技術的出現,使能從文字描述中直接合成 360 度全景圖像,標誌著沉浸式視覺內容創作的變革性進展。這項創新顯著簡化了傳統上複雜的製作此類內容的過程。最近在文字轉圖像擴散模型方面的進展加速了這個新興領域的快速發展。本調查提供了對文字驅動 360 度全景圖生成的全面回顧,深入分析了最先進的演算法及其在 360 度 3D 場景生成中的擴展應用。此外,我們批判性地審視了當前的限制,並提出了未來研究的有希望的方向。一個精選的專案頁面,其中包含相關資源和研究論文,可在 https://littlewhitesea.github.io/Text-Driven-Pano-Gen/ 獲得。 +This research presents an innovative approach to cancer diagnosis and +prediction using explainable Artificial Intelligence (XAI) and deep learning +techniques. With cancer causing nearly 10 million deaths globally in 2020, +early and accurate diagnosis is crucial. Traditional methods often face +challenges in cost, accuracy, and efficiency. Our study develops an AI model +that provides precise outcomes and clear insights into its decision-making +process, addressing the "black box" problem of deep learning models. By +employing XAI techniques, we enhance interpretability and transparency, +building trust among healthcare professionals and patients. Our approach +leverages neural networks to analyse extensive datasets, identifying patterns +for cancer detection. This model has the potential to revolutionise diagnosis +by improving accuracy, accessibility, and clarity in medical decision-making, +possibly leading to earlier detection and more personalised treatment +strategies. Furthermore, it could democratise access to high-quality +diagnostics, particularly in resource-limited settings, contributing to global +health equity. The model's applications extend beyond cancer diagnosis, +potentially transforming various aspects of medical decision-making and saving +millions of lives worldwide. -##### **Rapid Word Learning Through Meta In-Context Learning** -2502.14791v1 by Wentao Wang, Guangyuan Jiang, Tal Linzen, Brenden M. Lake +摘要:本研究提出了一個創新的癌症診斷和預測方法,使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡,因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型,它提供精確的結果並清楚地了解其決策過程,解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術,我們增強了解釋性和透明度,在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集,識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷,可能導致更早的檢測和更個性化的治療策略。此外,它可以使更多人獲得高品質的診斷,特別是在資源有限的環境中,有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷,還可能轉變醫療決策的各個方面,並拯救全球數百萬人的生命。 -Humans can quickly learn a new word from a few illustrative examples, and -then systematically and flexibly use it in novel contexts. Yet the abilities of -current language models for few-shot word learning, and methods for improving -these abilities, are underexplored. In this study, we introduce a novel method, -Meta-training for IN-context learNing Of Words (Minnow). This method trains -language models to generate new examples of a word's usage given a few -in-context examples, using a special placeholder token to represent the new -word. This training is repeated on many new words to develop a general -word-learning ability. We find that training models from scratch with Minnow on -human-scale child-directed language enables strong few-shot word learning, -comparable to a large language model (LLM) pre-trained on orders of magnitude -more data. Furthermore, through discriminative and generative evaluations, we -demonstrate that finetuning pre-trained LLMs with Minnow improves their ability -to discriminate between new words, identify syntactic categories of new words, -and generate reasonable new usages and definitions for new words, based on one -or a few in-context examples. These findings highlight the data efficiency of -Minnow and its potential to improve language model performance in word learning -tasks. +##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG** +2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag -摘要:人類可以從幾個說明性的範例中快速學習一個新字詞,然後系統性且靈活地將其用於新的脈絡中。然而,目前語言模型在少量字詞學習中的能力,以及改善這些能力的方法,尚未得到充分探討。在這項研究中,我們引入了一種新方法,即「用於字詞情境學習的元訓練」(Minnow)。此方法訓練語言模型在給定幾個情境範例的情況下,產生字詞用法的範例,並使用特殊佔位符標記來表示新的字詞。此訓練會在許多新字詞上重複進行,以培養一般的字詞學習能力。我們發現,從頭開始使用 Minnow 在人類規模的兒童導向語言上訓練模型,可以實現強大的少量字詞學習能力,這與預先在大量資料上訓練的大型語言模型 (LLM) 相當。此外,透過區辨性和生成性評估,我們證明使用 Minnow 微調預先訓練的 LLM 可以提升其區辨新字詞、識別新字詞的句法類別,以及根據一個或幾個情境範例產生合理的新用法和定義的能力。這些發現突顯了 Minnow 的資料效率,以及它在字詞學習任務中提升語言模型效能的潛力。 +Deep learning has advanced medical image classification, but interpretability +challenges hinder its clinical adoption. This study enhances interpretability +in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs) +and a multi-agent Retrieval-Augmented Generation (RAG) system for report +generation. By modeling relationships between visual features and clinical +concepts, we create interpretable concept vectors that guide a multi-agent RAG +system to generate radiology reports, enhancing clinical relevance, +explainability, and transparency. Evaluation of the generated reports using an +LLM-as-a-judge confirmed the interpretability and clinical utility of our +model's outputs. On the COVID-QU dataset, our model achieved 81% classification +accuracy and demonstrated robust report generation performance, with five key +metrics ranging between 84% and 90%. This interpretable multi-agent framework +bridges the gap between high-performance AI and the explainability required for +reliable AI-driven CXR analysis in clinical settings. Our code is available at +https://github.com/tifat58/IRR-with-CBM-RAG.git. -##### **SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features** -2502.14786v1 by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai +摘要:深度學習已提升醫學影像分類,但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成,來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係,我們建立可解釋的概念向量,引導多代理 RAG 系統生成放射報告,增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估,確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上,我們的模型達到了 81% 的分類準確率,並展示了穩健的報告生成效能,五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。 -We introduce SigLIP 2, a family of new multilingual vision-language encoders -that build on the success of the original SigLIP. In this second iteration, we -extend the original image-text training objective with several prior, -independently developed techniques into a unified recipe -- this includes -captioning-based pretraining, self-supervised losses (self-distillation, masked -prediction) and online data curation. With these changes, SigLIP 2 models -outperform their SigLIP counterparts at all model scales in core capabilities, -including zero-shot classification, image-text retrieval, and transfer -performance when extracting visual representations for Vision-Language Models -(VLMs). Furthermore, the new training recipe leads to significant improvements -on localization and dense prediction tasks. We also train variants which -support multiple resolutions and preserve the input's native aspect ratio. -Finally, we train on a more diverse data-mixture that includes de-biasing -techniques, leading to much better multilingual understanding and improved -fairness. To allow users to trade off inference cost with performance, we -release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), -and g (1B). +##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models** +2412.15748v1 by Shamus Sim, Tyrone Chen -摘要:我們推出了 SigLIP 2,這是一個新的多語言視覺語言編碼器系列,它建立在 SigLIP 的成功基礎上。在這個第二個版本中,我們將原來的圖像文字訓練目標與幾個先前獨立開發的技術擴展到一個統一的配方中,其中包括基於標題的預訓練、自我監督損失(自我蒸餾、遮罩預測)和線上數據策展。有了這些改變,SigLIP 2 模型在所有模型規模上都超越了 SigLIP 的對應模型,包括零次分類、圖像文字檢索和在為視覺語言模型 (VLM) 提取視覺表示時傳輸效能。此外,新的訓練配方也大幅改善了定位和密集預測任務。我們還訓練了支援多種解析度和保留輸入原生長寬比的變體。最後,我們在一個更為多樣化的數據組合上進行訓練,其中包括去偏見技術,從而大幅提升多語言理解力並改善公平性。為了讓使用者權衡推理成本與效能,我們發布了四種大小的模型檢查點:ViT-B (86M)、L (303M)、So400m (400M) 和 g (1B)。 +Background: Despite the current ubiquity of Large Language Models (LLMs) +across the medical domain, there is a surprising lack of studies which address +their reasoning behaviour. We emphasise the importance of understanding +reasoning behaviour as opposed to high-level prediction accuracies, since it is +equivalent to explainable AI (XAI) in this context. In particular, achieving +XAI in medical LLMs used in the clinical domain will have a significant impact +across the healthcare sector. Results: Therefore, we define the concept of +reasoning behaviour in the specific context of medical LLMs. We then categorise +and discuss the current state of the art of methods which evaluate reasoning +behaviour in medical LLMs. Finally, we propose theoretical frameworks which can +empower medical professionals or machine learning engineers to gain insight +into the low-level reasoning operations of these previously obscure models. +Conclusion: The subsequent increased transparency and trust in medical machine +learning models by clinicians as well as patients will accelerate the +integration, application as well as further development of medical AI for the +healthcare system as a whole -##### **ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting** -2502.14780v1 by Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, Minji Kim +摘要:背景:儘管大型語言模型 (LLM) 目前在醫療領域無所不在,但令人驚訝的是,探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要,因為在這種情況下,這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI,將對整個醫療保健產業產生重大影響。結果:因此,我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後,我們提出理論架構,讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論:臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升,將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。 -Efficient and privacy-preserving multimodal interaction is essential as AR, -VR, and modern smartphones with powerful cameras become primary interfaces for -human-computer communication. Existing powerful large vision-language models -(VLMs) enabling multimodal interaction often rely on cloud-based processing, -raising significant concerns about (1) visual privacy by transmitting sensitive -vision data to servers, and (2) their limited real-time, on-device usability. -This paper explores Visual Instruction Rewriting, a novel approach that -transforms multimodal instructions into text-only commands, allowing seamless -integration of lightweight on-device instruction rewriter VLMs (250M -parameters) with existing conversational AI systems, enhancing vision data -privacy. To achieve this, we present a dataset of over 39,000 examples across -14 domains and develop a compact VLM, pretrained on image captioning datasets -and fine-tuned for instruction rewriting. Experimental results, evaluated -through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic -parsing analysis, demonstrate that even a quantized version of the model -(<500MB storage footprint) can achieve effective instruction rewriting, thus -enabling privacy-focused, multimodal AI applications. +##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media** +2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton -摘要:高效且重視隱私的多模態互動至關重要,因為 AR、VR 和配備強大相機的現代智慧型手機已成為人機溝通的主要介面。現有的強大大型視覺語言模型 (VLM) 能支援多模態互動,通常仰賴雲端處理,這引發了重大的疑慮,包括:(1) 將敏感的視覺資料傳輸至伺服器,會造成視覺隱私問題,以及 (2) 其有限的即時、裝置上可用性。本文探討視覺指令改寫,這是一種新穎的方法,可將多模態指令轉換為純文字指令,讓輕量級的裝置上指令改寫 VLM (250M 參數) 與現有的對話式 AI 系統無縫整合,進而強化視覺資料的隱私。為達成此目標,我們提供一個跨越 14 個領域、超過 39,000 個範例的資料集,並開發一個精簡的 VLM,在圖片標題資料集上進行預訓練,並針對指令改寫進行微調。實驗結果透過 NLG 指標(例如 BLEU、METEOR 和 ROUGE)以及語意解析分析進行評估,證明即使是模型的量化版本(<500MB 儲存空間佔用量)也能有效執行指令改寫,進而支援注重隱私的多模態 AI 應用程式。 +Stress is a pervasive global health issue that can lead to severe mental +health problems. Early detection offers timely intervention and prevention of +stress-related disorders. The current early detection models perform "black +box" inference suffering from limited explainability and trust which blocks the +real-world clinical application. Thanks to the generative properties introduced +by the Large Language Models (LLMs), the decision and the prediction from such +models are semi-interpretable through the corresponding description. However, +the existing LLMs are mostly trained for general purposes without the guidance +of psychological cognitive theory. To this end, we first highlight the +importance of prior theory with the observation of performance boosted by the +chain-of-thoughts tailored for stress detection. This method termed Cognition +Chain explicates the generation of stress through a step-by-step cognitive +perspective based on cognitive appraisal theory with a progress pipeline: +Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress +State, guiding LLMs to provide comprehensive reasoning explanations. We further +study the benefits brought by the proposed Cognition Chain format by utilising +it as a synthetic dataset generation template for LLMs instruction-tuning and +introduce CogInstruct, an instruction-tuning dataset for stress detection. This +dataset is developed using a three-stage self-reflective annotation pipeline +that enables LLMs to autonomously generate and refine instructional data. By +instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable +stress detection model. Evaluations demonstrate that CogLLM achieves +outstanding performance while enhancing explainability. Our work contributes a +novel approach by integrating cognitive theories into LLM reasoning processes, +offering a promising direction for future explainable AI research. -##### **Harnessing PDF Data for Improving Japanese Large Multimodal Models** -2502.14778v1 by Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa +摘要:壓力是一個普遍的全球性健康問題,可能會導致嚴重的精神 +健康問題。早期發現提供及時的干預和預防 +壓力相關疾病。目前的早期發現模型執行「黑 +盒子」推論,存在可解釋性和信任度有限的問題,阻礙了 +現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性,此類 +模型的決策和預測通過對應描述具有半可解釋性。然而, +現有的 LLM 主要針對一般用途進行訓練,沒有心理認知理論的指導。為此,我們首先強調 +先驗理論的重要性,並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知 +鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生,並具有進度管道: +刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力 +狀態,指導 LLM 提供全面的推理解釋。我們進一步 +通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點,並介紹 CogInstruct,這是一個針對壓力檢測的指令調整數據集。這個 +數據集是使用一個三階段的自省標註管道開發的,使 LLM 能夠自主生成和優化指令數據。通過 +使用 CogInstruct 對 Llama3 進行指令調整,我們開發了 CogLLM,這是一個可解釋的 +壓力檢測模型。評估表明,CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中,提出了一種新穎的方法, +為未來的可解釋人工智能研究提供了一個有希望的方向。 -Large Multimodal Models (LMMs) have demonstrated strong performance in -English, but their effectiveness in Japanese remains limited due to the lack of -high-quality training data. Current Japanese LMMs often rely on translated -English datasets, restricting their ability to capture Japan-specific cultural -knowledge. To address this, we explore the potential of Japanese PDF data as a -training resource, an area that remains largely underutilized. We introduce a -fully automated pipeline that leverages pretrained models to extract image-text -pairs from PDFs through layout analysis, OCR, and vision-language pairing, -removing the need for manual annotation. Additionally, we construct instruction -data from extracted image-text pairs to enrich the training data. To evaluate -the effectiveness of PDF-derived data, we train Japanese LMMs and assess their -performance on the Japanese LMM Benchmark. Our results demonstrate substantial -improvements, with performance gains ranging from 3.9% to 13.8% on Heron-Bench. -Further analysis highlights the impact of PDF-derived data on various factors, -such as model size and language models, reinforcing its value as a multimodal -resource for Japanese LMMs. We plan to make the source code and data publicly -available upon acceptance. +##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology** +2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi + +Human-machine teaming in medical AI requires us to understand to what degree +a trained clinician should weigh AI predictions. While previous work has shown +the potential of AI assistance at improving clinical predictions, existing +clinical decision support systems either provide no explainability of their +predictions or use techniques like saliency and Shapley values, which do not +allow for physician-based verification. To address this gap, this study +compares previously used explainable AI techniques with a newly proposed +technique termed '2-factor retrieval (2FR)', which is a combination of +interface design and search retrieval that returns similarly labeled data +without processing this data. This results in a 2-factor security blanket +where: (a) correct images need to be retrieved by the AI; and (b) humans should +associate the retrieved images with the current pathology under test. We find +that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician +accuracy, with particular improvements when clinicians are radiologists and +have low confidence in their decision. Our results highlight the importance of +understanding how different modes of human-AI decision making may impact +clinician accuracy in clinical decision support systems. -摘要:大型多模態模型 (LMM) 已在英語中表現出強勁的效能,但由於缺乏高品質的訓練資料,它們在日語中的效能仍然有限。目前的日語 LMM 通常依賴於翻譯後的英語資料集,限制了它們擷取特定於日本的文化知識的能力。為了解決這個問題,我們探索了日語 PDF 資料作為訓練資源的潛力,這個領域在很大程度上仍然未被充分利用。我們引入了一個全自動的管道,利用預先訓練好的模型透過版面分析、光學字元辨識和視覺語言配對從 PDF 中擷取影像文字對,消除了手動註解的需要。此外,我們從擷取的影像文字對中建構說明資料,以豐富訓練資料。為了評估 PDF 衍生資料的效能,我們訓練了日語 LMM,並在日語 LMM 基準上評估它們的效能。我們的結果證明了顯著的進步,在 Heron-Bench 上的效能提升幅度從 3.9% 到 13.8%。進一步的分析重點說明了 PDF 衍生資料對各種因素的影響,例如模型大小和語言模型,加強了其作為日語 LMM 的多模態資源的價值。我們計畫在接受後公開原始程式碼和資料。 +摘要:人機協作在醫療 AI 中,需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力,但現有的臨床決策支援系統,要不就沒有提供預測的可解釋性,要不就是使用像顯著性和 Shapley 值之類的技術,這些技術不允許基於醫生的驗證。為了解決這個差距,本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較,後者是一種介面設計和搜尋檢索的組合,它會傳回標籤相似的資料,而不會處理這些資料。這會產生一個 2 因子安全機制,其中:(a) 正確的影像需要由 AI 檢索;(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現,當在胸部 X 光診斷上進行測試時,2FR 會提高臨床醫生的準確度,特別是在臨床醫生是放射科醫生且對其決策信心不足時,會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。 -##### **Making Universal Policies Universal** -2502.14777v1 by Niklas Höpner, David Kuric, Herke van Hoof +##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance** +2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle -The development of a generalist agent capable of solving a wide range of -sequential decision-making tasks remains a significant challenge. We address -this problem in a cross-agent setup where agents share the same observation -space but differ in their action spaces. Our approach builds on the universal -policy framework, which decouples policy learning into two stages: a -diffusion-based planner that generates observation sequences and an inverse -dynamics model that assigns actions to these plans. We propose a method for -training the planner on a joint dataset composed of trajectories from all -agents. This method offers the benefit of positive transfer by pooling data -from different agents, while the primary challenge lies in adapting shared -plans to each agent's unique constraints. We evaluate our approach on the -BabyAI environment, covering tasks of varying complexity, and demonstrate -positive transfer across agents. Additionally, we examine the planner's -generalisation ability to unseen agents and compare our method to traditional -imitation learning approaches. By training on a pooled dataset from multiple -agents, our universal policy achieves an improvement of up to $42.20\%$ in task -completion accuracy compared to a policy trained on a dataset from a single -agent. +Understanding public perception of artificial intelligence (AI) and the +tradeoffs between potential risks and benefits is crucial, as these perceptions +might shape policy decisions, influence innovation trajectories for successful +market strategies, and determine individual and societal acceptance of AI +technologies. Using a representative sample of 1100 participants from Germany, +this study examines mental models of AI. Participants quantitatively evaluated +71 statements about AI's future capabilities (e.g., autonomous driving, medical +care, art, politics, warfare, and societal divides), assessing the expected +likelihood of occurrence, perceived risks, benefits, and overall value. We +present rankings of these projections alongside visual mappings illustrating +public risk-benefit tradeoffs. While many scenarios were deemed likely, +participants often associated them with high risks, limited benefits, and low +overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in +value assessment can be explained by perceived risks ($\beta=-.504$) and +perceived benefits ($\beta=+.710$), with no significant relation to expected +likelihood. Demographics and personality traits influenced perceptions of +risks, benefits, and overall evaluations, underscoring the importance of +increasing AI literacy and tailoring public information to diverse user needs. +These findings provide actionable insights for researchers, developers, and +policymakers by highlighting critical public concerns and individual factors +essential to align AI development with individual values. -摘要:開發一種能夠解決廣泛順序決策任務的通才代理仍然是一項重大挑戰。我們在跨代理設置中解決這個問題,其中代理共享相同的觀察空間,但在其動作空間中有所不同。我們的做法建立在通用策略框架之上,該框架將策略學習解耦為兩個階段:生成觀察序列的基於擴散的規劃器和將動作分配給這些計劃的逆動態模型。我們提出了一種在由所有代理的軌跡組成的聯合數據集上訓練規劃器的方法。這種方法提供了通過彙總來自不同代理的數據來進行正向傳輸的好處,而主要的挑戰在於將共享計劃適應於每個代理的唯一約束。我們在 BabyAI 環境中評估了我們的做法,涵蓋了不同複雜程度的任務,並展示了跨代理的正向傳輸。此外,我們檢查了規劃器對未見代理的概括能力,並將我們的做法與傳統的模仿學習方法進行了比較。通過在來自多個代理的彙總數據集上進行訓練,我們的通用策略在任務完成準確度方面實現了高達 42.20% 的改進,而從單個代理的數據集上訓練的策略。 +摘要:了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要,因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡,並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本,探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述(例如,自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧)進行了定量評估,評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名,並附上視覺化映射,說明了公眾的風險收益權衡。儘管許多場景被認為是可能的,但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中,96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋,與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法,這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素,為研究人員、開發人員和政策制定者提供了可行的見解。 -##### **SurveyX: Academic Survey Automation via Large Language Models** -2502.14776v1 by Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li +##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset** +2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey -Large Language Models (LLMs) have demonstrated exceptional comprehension -capabilities and a vast knowledge base, suggesting that LLMs can serve as -efficient tools for automated survey generation. However, recent research -related to automated survey generation remains constrained by some critical -limitations like finite context window, lack of in-depth content discussion, -and absence of systematic evaluation frameworks. Inspired by human writing -processes, we propose SurveyX, an efficient and organized system for automated -survey generation that decomposes the survey composing process into two phases: -the Preparation and Generation phases. By innovatively introducing online -reference retrieval, a pre-processing method called AttributeTree, and a -re-polishing process, SurveyX significantly enhances the efficacy of survey -composition. Experimental evaluation results show that SurveyX outperforms -existing automated survey generation systems in content quality (0.259 -improvement) and citation quality (1.76 enhancement), approaching human expert -performance across multiple evaluation dimensions. Examples of surveys -generated by SurveyX are available on www.surveyx.cn +The use of machine learning and AI on electronic health records (EHRs) holds +substantial potential for clinical insight. However, this approach faces +challenges due to data heterogeneity, sparsity, temporal misalignment, and +limited labeled outcomes. In this context, we leverage a linked EHR dataset of +approximately one million de-identified individuals from Bristol, North +Somerset, and South Gloucestershire, UK, to characterize urinary tract +infections (UTIs). We implemented a data pre-processing and curation pipeline +that transforms the raw EHR data into a structured format suitable for +developing predictive models focused on data fairness, accountability and +transparency. Given the limited availability and biases of ground truth UTI +outcomes, we introduce a UTI risk estimation framework informed by clinical +expertise to estimate UTI risk across individual patient timelines. Pairwise +XGBoost models are trained using this framework to differentiate UTI risk +categories with explainable AI techniques applied to identify key predictors +and support interpretability. Our findings reveal differences in clinical and +demographic predictors across risk groups. While this study highlights the +potential of AI-driven insights to support UTI clinical decision-making, +further investigation of patient sub-strata and extensive validation are needed +to ensure robustness and applicability in clinical practice. -摘要:大型語言模型 (LLM) 已展現出卓越的理解能力和廣泛的知識庫,表示 LLM 可作為自動調查生成的有用工具。然而,與自動調查生成相關的最新研究仍受到一些關鍵限制的約束,例如有限的上下文視窗、缺乏深入的內容討論以及系統評估架構的缺失。受到人類寫作過程的啟發,我們提出 SurveyX,這是一個用於自動調查生成的有效且有組織的系統,它將調查組成過程分解為兩個階段:準備和生成階段。透過創新地引入線上參考檢索、一種稱為 AttributeTree 的預處理方法和重新潤飾過程,SurveyX 大幅提升了調查組成的效能。實驗評估結果顯示,SurveyX 在內容品質(提升 0.259)和引用品質(提升 1.76)方面優於現有的自動調查生成系統,在多個評估面向中接近人類專家的表現。由 SurveyX 生成的調查範例可在 www.surveyx.cn 取得 +摘要:電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而,由於資料異質性、稀疏性、時間錯位和標籤結果有限,此方法面臨挑戰。在此背景下,我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集,來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線,適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差,我們引入了由臨床專業知識告知的 UTI 風險評估架構,以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練,以區分 UTI 風險類別,並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力,但仍需要進一步調查患者子群體和廣泛驗證,以確保在臨床實務中的穩健性和適用性。 -##### **Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning** -2502.14768v1 by Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo +##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care** +2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez -Inspired by the success of DeepSeek-R1, we explore the potential of -rule-based reinforcement learning (RL) in large reasoning models. To analyze -reasoning dynamics, we use synthetic logic puzzles as training data due to -their controllable complexity and straightforward answer verification. We make -some key technical contributions that lead to effective and stable RL training: -a system prompt that emphasizes the thinking and answering process, a stringent -format reward function that penalizes outputs for taking shortcuts, and a -straightforward training recipe that achieves stable convergence. Our 7B model -develops advanced reasoning skills-such as reflection, verification, and -summarization-that are absent from the logic corpus. Remarkably, after training -on just 5K logic problems, it demonstrates generalization abilities to the -challenging math benchmarks AIME and AMC. +There is a growing need to understand how digital systems can support +clinical decision-making, particularly as artificial intelligence (AI) models +become increasingly complex and less human-interpretable. This complexity +raises concerns about trustworthiness, impacting safe and effective adoption of +such technologies. Improved understanding of decision-making processes and +requirements for explanations coming from decision support tools is a vital +component in providing effective explainable solutions. This is particularly +relevant in the data-intensive, fast-paced environments of intensive care units +(ICUs). To explore these issues, group interviews were conducted with seven ICU +clinicians, representing various roles and experience levels. Thematic analysis +revealed three core themes: (T1) ICU decision-making relies on a wide range of +factors, (T2) the complexity of patient state is challenging for shared +decision-making, and (T3) requirements and capabilities of AI decision support +systems. We include design recommendations from clinical input, providing +insights to inform future AI systems for intensive care. -摘要:在 DeepSeek-R1 成功案例的启发下,我们探索了基于规则的强化学习 (RL) 在大型推理模型中的潜力。为了分析推理动态,我们使用合成逻辑难题作为训练数据,因为它们的可控复杂性和直接的答案验证。我们做出了一些关键的技术贡献,这些贡献导致了有效且稳定的 RL 训练:一个强调思考和回答过程的系统提示、一个严格的格式奖励函数,用于惩罚采取捷径的输出,以及一个实现稳定收敛的直接训练配方。我们的 7B 模型发展了高级推理技能,例如反射、验证和总结,这些技能在逻辑语料库中是不存在的。值得注意的是,在仅对 5K 个逻辑问题进行训练后,它展示了对具有挑战性的数学基准 AIME 和 AMC 的泛化能力。 +摘要:隨著人工智慧 (AI) 模型變得越來越複雜,且越來越難以被人理解,了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮,影響了此類技術的安全且有效採用。改善對決策制定流程的理解,以及對決策支援工具所提供說明的要求,是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題,對七位 ICU 臨床醫師進行了小組訪談,這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題:(T1) ICU 決策制定依賴於廣泛的因素,(T2) 病患狀態的複雜性對共同決策制定構成挑戰,以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議,提供見解以提供資訊給未來用於加護的 AI 系統。 -##### **Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis** -2502.14767v1 by Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han +##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning** +2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra -With the exponential growth of research facilitated by modern technology and -improved accessibility, scientific discoveries have become increasingly -fragmented within and across fields. This makes it challenging to assess the -significance, novelty, incremental findings, and equivalent ideas between -related works, particularly those from different research communities. Large -language models (LLMs) have recently demonstrated strong quantitative and -qualitative reasoning abilities, and multi-agent LLM debates have shown promise -in handling complex reasoning tasks by exploring diverse perspectives and -reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a -framework which converts scientific papers into LLM personas that debate their -respective novelties. To emphasize structured, critical reasoning rather than -focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling -fine-grained analysis of independent novelty arguments within scholarly -articles. Through experiments on scientific literature across various domains, -evaluated by expert researchers, we demonstrate that ToD generates informative -arguments, effectively contrasts papers, and supports researchers in their -literature review. +Pediatric heart diseases present a broad spectrum of congenital and acquired +diseases. More complex congenital malformations require a differentiated and +multimodal decision-making process, usually including echocardiography as a +central imaging method. Artificial intelligence (AI) offers considerable +promise for clinicians by facilitating automated interpretation of pediatric +echocardiography data. However, adapting AI technologies for pediatric +echocardiography analysis has challenges such as limited public data +availability, data privacy, and AI model transparency. Recently, researchers +have focused on disruptive technologies, such as federated learning (FL) and +explainable AI (XAI), to improve automatic diagnostic and decision support +workflows. This study offers a comprehensive overview of the limitations and +opportunities of AI in pediatric echocardiography, emphasizing the synergistic +workflow and role of XAI and FL, identifying research gaps, and exploring +potential future developments. Additionally, three relevant clinical use cases +demonstrate the functionality of XAI and FL with a focus on (i) view +recognition, (ii) disease classification, (iii) segmentation of cardiac +structures, and (iv) quantitative assessment of cardiac function. -摘要:隨著現代科技促進的研究呈指數成長,加上可近性的提升,科學發現已在各領域內外變得越來越分散。這使得評估相關作品之間的重要性、新穎性、漸進式發現和等價概念變得具有挑戰性,特別是來自不同研究社群的作品。大型語言模型 (LLM) 近期已展現出強大的量化和質化推理能力,而多重代理 LLM 辯論已在處理複雜推理任務方面展現出潛力,方法是探索不同的觀點和推理路徑。受到此啟發,我們引入了辯論樹 (ToD),這是一個將科學論文轉換為 LLM 人格的架構,這些人格會辯論各自的新穎性。為了強調結構化、批判性推理,而非僅專注於結果,ToD 會動態建構一個辯論樹,讓使用者能夠深入分析學術文章中獨立的新穎性論點。透過在不同領域的科學文獻上進行實驗,並由專家研究員進行評估,我們證明了 ToD 能產生有見地的論點、有效對比論文,並在研究人員的文獻回顧中提供協助。 +摘要:小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程,通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望,因為它可以促進小兒超音波檢查資料的自動化解讀。然而,將人工智慧技術應用於小兒超音波檢查分析有許多挑戰,例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近,研究人員專注於破壞性技術,例如聯合學習 (FL) 和可解釋人工智慧 (XAI),以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述,強調了 XAI 和 FL 的協同工作流程和角色,找出研究差距並探討潛在的未來發展。此外,三個相關的臨床使用案例展示了 XAI 和 FL 的功能,重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。 -##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** -2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes +##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering** +2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust -Fact verification (FV) aims to assess the veracity of a claim based on -relevant evidence. The traditional approach for automated FV includes a -three-part pipeline relying on short evidence snippets and encoder-only -inference models. More recent approaches leverage the multi-turn nature of LLMs -to address FV as a step-by-step problem where questions inquiring additional -context are generated and answered until there is enough information to make a -decision. This iterative method makes the verification process rational and -explainable. While these methods have been tested for encyclopedic claims, -exploration on domain-specific and realistic claims is missing. In this work, -we apply an iterative FV system on three medical fact-checking datasets and -evaluate it with multiple settings, including different LLMs, external web -search, and structured reasoning using logic predicates. We demonstrate -improvements in the final performance over traditional approaches and the high -potential of step-by-step FV systems for domain-specific claims. +Osteoporosis is a common condition that increases fracture risk, especially +in older adults. Early diagnosis is vital for preventing fractures, reducing +treatment costs, and preserving mobility. However, healthcare providers face +challenges like limited labeled data and difficulties in processing medical +images. This study presents a novel multi-modal learning framework that +integrates clinical and imaging data to improve diagnostic accuracy and model +interpretability. The model utilizes three pre-trained networks-VGG19, +InceptionV3, and ResNet50-to extract deep features from X-ray images. These +features are transformed using PCA to reduce dimensionality and focus on the +most relevant components. A clustering-based selection process identifies the +most representative components, which are then combined with preprocessed +clinical data and processed through a fully connected network (FCN) for final +classification. A feature importance plot highlights key variables, showing +that Medical History, BMI, and Height were the main contributors, emphasizing +the significance of patient-specific data. While imaging features were +valuable, they had lower importance, indicating that clinical data are crucial +for accurate predictions. This framework promotes precise and interpretable +predictions, enhancing transparency and building trust in AI-driven diagnoses +for clinical integration. -摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 +摘要:骨質疏鬆症是一種常見的疾病,會增加骨折的風險,特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而,醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架,該框架整合了臨床和影像數據,以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路,VGG19、InceptionV3 和 ResNet50,從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分,然後將這些組成部分與預處理的臨床數據結合,並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數,表明病史、BMI 和身高是主要貢獻因素,強調了患者特定數據的重要性。雖然影像特徵很有價值,但它們的重要性較低,這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測,提高了透明度,並建立了對 AI 驅動診斷在臨床整合中的信任。 -##### **EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations** -2502.14760v1 by Haotian Zhai, Connor Lawless, Ellen Vitercik, Liu Leqi +##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection** +2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor -A fundamental problem in combinatorial optimization is identifying equivalent -formulations, which can lead to more efficient solution strategies and deeper -insights into a problem's computational complexity. The need to automatically -identify equivalence between problem formulations has grown as optimization -copilots--systems that generate problem formulations from natural language -descriptions--have proliferated. However, existing approaches to checking -formulation equivalence lack grounding, relying on simple heuristics which are -insufficient for rigorous validation. Inspired by Karp reductions, in this work -we introduce quasi-Karp equivalence, a formal criterion for determining when -two optimization formulations are equivalent based on the existence of a -mapping between their decision variables. We propose EquivaMap, a framework -that leverages large language models to automatically discover such mappings, -enabling scalable and reliable equivalence verification. To evaluate our -approach, we construct the first open-source dataset of equivalent optimization -formulations, generated by applying transformations such as adding slack -variables or valid inequalities to existing formulations. Empirically, -EquivaMap significantly outperforms existing methods, achieving substantial -improvements in correctly identifying formulation equivalence. +This review paper explores recent advances in deep learning approaches for +non-invasive cognitive impairment detection. We examine various non-invasive +indicators of cognitive decline, including speech and language, facial, and +motoric mobility. The paper provides an overview of relevant datasets, +feature-extracting techniques, and deep-learning architectures applied to this +domain. We have analyzed the performance of different methods across modalities +and observed that speech and language-based methods generally achieved the +highest detection performance. Studies combining acoustic and linguistic +features tended to outperform those using a single modality. Facial analysis +methods showed promise for visual modalities but were less extensively studied. +Most papers focused on binary classification (impaired vs. non-impaired), with +fewer addressing multi-class or regression tasks. Transfer learning and +pre-trained language models emerged as popular and effective techniques, +especially for linguistic analysis. Despite significant progress, several +challenges remain, including data standardization and accessibility, model +explainability, longitudinal analysis limitations, and clinical adaptation. +Lastly, we propose future research directions, such as investigating +language-agnostic speech analysis methods, developing multi-modal diagnostic +systems, and addressing ethical considerations in AI-assisted healthcare. By +synthesizing current trends and identifying key obstacles, this review aims to +guide further development of deep learning-based cognitive impairment detection +systems to improve early diagnosis and ultimately patient outcomes. -摘要:組合優化中的基本問題在於識別等效公式,這可能導致更有效的解決策略,並更深入地了解問題的計算複雜性。隨著優化輔助系統(從自然語言描述中產生問題公式的系統)的普及,自動識別問題公式之間等價性的需求也隨之增加。然而,現有的公式等價性檢查方法缺乏依據,依賴於簡單的啟發法,而這對於嚴格驗證來說是不夠的。受 Karp 遞減啟發,我們在這項工作中引入了準 Karp 等價性,這是一個正式標準,用於根據決策變數之間的映射存在性來確定兩個優化公式何時等效。我們提出了 EquivaMap,一個利用大型語言模型自動發現此類映射的框架,實現可擴充且可靠的等價性驗證。為了評估我們的做法,我們構建了第一個等效優化公式的開源資料集,該資料集是通過對現有公式套用轉換(例如添加鬆弛變數或有效不等式)產生的。根據經驗,EquivaMap 明顯優於現有方法,在正確識別公式等價性方面取得了顯著進展。 +摘要:本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標,包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現,並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力,但研究較少。大多數論文專注於二元分類(受損與未受損),較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術,特別是對於語言分析。儘管取得了重大進展,但仍存在一些挑戰,包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後,我們提出了未來的研究方向,例如調查與語言無關的語音分析方法、開發多模式診斷系統,以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙,本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展,以改善早期診斷,並最終改善患者的治療結果。 -##### **On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems** -2502.14759v1 by Juraj Vladika, Florian Matthes +##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems** +2410.17504v1 by Shruthi Chari -Retrieval-augmented generation (RAG) has emerged as an approach to augment -large language models (LLMs) by reducing their reliance on static knowledge and -improving answer factuality. RAG retrieves relevant context snippets and -generates an answer based on them. Despite its increasing industrial adoption, -systematic exploration of RAG components is lacking, particularly regarding the -ideal size of provided context, and the choice of base LLM and retrieval -method. To help guide development of robust RAG systems, we evaluate various -context sizes, BM25 and semantic search as retrievers, and eight base LLMs. -Moving away from the usual RAG evaluation with short answers, we explore the -more challenging long-form question answering in two domains, where a good -answer has to utilize the entire context. Our findings indicate that final QA -performance improves steadily with up to 15 snippets but stagnates or declines -beyond that. Finally, we show that different general-purpose LLMs excel in the -biomedical domain than the encyclopedic one, and that open-domain evidence -retrieval in large corpora is challenging. +Explainable Artificial Intelligence (AI) focuses on helping humans understand +the working of AI systems or their decisions and has been a cornerstone of AI +for decades. Recent research in explainability has focused on explaining the +workings of AI models or model explainability. There have also been several +position statements and review papers detailing the needs of end-users for +user-centered explainability but fewer implementations. Hence, this thesis +seeks to bridge some gaps between model and user-centered explainability. We +create an explanation ontology (EO) to represent literature-derived explanation +types via their supporting components. We implement a knowledge-augmented +question-answering (QA) pipeline to support contextual explanations in a +clinical setting. Finally, we are implementing a system to combine explanations +from different AI methods and data modalities. Within the EO, we can represent +fifteen different explanation types, and we have tested these representations +in six exemplar use cases. We find that knowledge augmentations improve the +performance of base large language models in the contextualized QA, and the +performance is variable across disease groups. In the same setting, clinicians +also indicated that they prefer to see actionability as one of the main foci in +explanations. In our explanations combination method, we plan to use similarity +metrics to determine the similarity of explanations in a chronic disease +detection setting. Overall, through this thesis, we design methods that can +support knowledge-enabled explanations across different use cases, accounting +for the methods in today's AI era that can generate the supporting components +of these explanations and domain knowledge sources that can enhance them. -摘要:檢索增強生成 (RAG) 已成為一種方法,可透過減少大型語言模型 (LLM) 對靜態知識的依賴,並改善答案的真實性,來增強大型語言模型 (LLM)。RAG 會擷取相關的內容片段,並根據這些片段產生答案。儘管其產業採用率不斷提高,但缺乏對 RAG 組成的系統性探討,特別是在提供的內容的理想大小,以及基礎 LLM 和檢索方法的選擇方面。為了協助引導穩健 RAG 系統的開發,我們評估了各種內容大小、BM25 和語意搜尋作為檢索器,以及八個基礎 LLM。我們不再使用簡短答案進行常見的 RAG 評估,而是探討在兩個領域中更具挑戰性的長篇問答,其中一個好的答案必須利用整個內容。我們的研究結果指出,最終的問答效能會隨著多達 15 個片段而穩定提升,但在超過這個數量後就會停滯或下降。最後,我們表明不同的通用 LLM 在生物醫學領域比百科全書領域更為出色,而且在大型語料庫中進行開放領域證據檢索具有挑戰性。 +摘要:可解釋人工智慧(AI)專注於協助人類了解 AI 系統運作或其決策,數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求,但實作較少。因此,本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體(EO)以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答(QA)管線,以在臨床環境中支援情境解釋。最後,我們正在實作一個系統,以結合來自不同 AI 方法和資料模式的解釋。在 EO 中,我們可以表示 15 種不同的解釋類型,並且我們已在六個範例使用案例中測試這些表示。我們發現,知識增強改善了基礎大型語言模型在情境化 QA 中的效能,並且效能因疾病群組而異。在相同的環境中,臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中,我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言,透過本論文,我們設計了可以在不同使用案例中支援知識啟用解釋的方法,考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。 -##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** -2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari +##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study** +2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay -Medical images are acquired at high resolutions with large fields of view in -order to capture fine-grained features necessary for clinical decision-making. -Consequently, training deep learning models on medical images can incur large -computational costs. In this work, we address the challenge of downsizing -medical images in order to improve downstream computational efficiency while -preserving clinically-relevant features. We introduce MedVAE, a family of six -large-scale 2D and 3D autoencoders capable of encoding medical images as -downsized latent representations and decoding latent representations back to -high-resolution images. We train MedVAE autoencoders using a novel two-stage -training approach with 1,052,730 medical images. Across diverse tasks obtained -from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent -representations in place of high-resolution images when training downstream -models can lead to efficiency benefits (up to 70x improvement in throughput) -while simultaneously preserving clinically-relevant features and (2) MedVAE can -decode latent representations back to high-resolution images with high -fidelity. Our work demonstrates that large-scale, generalizable autoencoders -can help address critical efficiency challenges in the medical domain. Our code -is available at https://github.com/StanfordMIMI/MedVAE. +Objectives: To investigate clinicians' attitudes towards current automated +interpretation of ECG and novel AI technologies and their perception of +computer-assisted interpretation. Materials and Methods: We conducted a series +of interviews with clinicians in the UK. Our study: (i) explores the potential +for AI, specifically future 'human-like' computing approaches, to facilitate +ECG interpretation and support clinical decision making, and (ii) elicits their +opinions about the importance of explainability and trustworthiness of AI +algorithms. Results: We performed inductive thematic analysis on interview +transcriptions from 23 clinicians and identified the following themes: (i) a +lack of trust in current systems, (ii) positive attitudes towards future AI +applications and requirements for these, (iii) the relationship between the +accuracy and explainability of algorithms, and (iv) opinions on education, +possible deskilling, and the impact of AI on clinical competencies. Discussion: +Clinicians do not trust current computerised methods, but welcome future 'AI' +technologies. Where clinicians trust future AI interpretation to be accurate, +they are less concerned that it is explainable. They also preferred ECG +interpretation that demonstrated the results of the algorithm visually. Whilst +clinicians do not fear job losses, they are concerned about deskilling and the +need to educate the workforce to use AI responsibly. Conclusion: Clinicians are +positive about the future application of AI in clinical decision-making. +Accuracy is a key factor of uptake and visualisations are preferred over +current computerised methods. This is viewed as a potential means of training +and upskilling, in contrast to the deskilling that automation might be +perceived to bring. -摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 +摘要:目的:調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度,以及他們對電腦輔助解讀的看法。材料和方法:我們對英國的臨床醫生進行了一系列訪談。我們的研究:(i) 探討人工智慧的潛力,特別是未來的「類人類」運算方法,以促進心電圖解讀並支持臨床決策制定,以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果:我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析,並找出以下主題:(i) 對目前系統缺乏信任,(ii) 對未來人工智慧應用和對這些應用的要求持正面態度,(iii) 演算法的準確性和可解釋性之間的關係,以及 (iv) 對教育、可能的技能退化,以及人工智慧對臨床能力的影響的看法。討論:臨床醫生不信任目前的電腦化方法,但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下,他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業,但他們擔心技能退化,以及需要教育員工負責任地使用人工智慧。結論:臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素,而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法,與自動化可能帶來的技能退化形成對比。 -##### **TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators** -2502.14752v1 by Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun +##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer** +2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker -Triton, a high-level Python-like language designed for building efficient GPU -kernels, is widely adopted in deep learning frameworks due to its portability, -flexibility, and accessibility. However, programming and parallel optimization -still require considerable trial and error from Triton developers. Despite -advances in large language models (LLMs) for conventional code generation, -these models struggle to generate accurate, performance-optimized Triton code, -as they lack awareness of its specifications and the complexities of GPU -programming. More critically, there is an urgent need for systematic -evaluations tailored to Triton. In this work, we introduce TritonBench, the -first comprehensive benchmark for Triton operator generation. TritonBench -features two evaluation channels: a curated set of 184 real-world operators -from GitHub and a collection of operators aligned with PyTorch interfaces. -Unlike conventional code benchmarks prioritizing functional correctness, -TritonBench also profiles efficiency performance on widely deployed GPUs -aligned with industry applications. Our study reveals that current -state-of-the-art code LLMs struggle to generate efficient Triton operators, -highlighting a significant gap in high-performance code generation. TritonBench -will be available at https://github.com/thunlp/TritonBench. +The aggressiveness of prostate cancer, the most common cancer in men +worldwide, is primarily assessed based on histopathological data using the +Gleason scoring system. While artificial intelligence (AI) has shown promise in +accurately predicting Gleason scores, these predictions often lack inherent +explainability, potentially leading to distrust in human-machine interactions. +To address this issue, we introduce a novel dataset of 1,015 tissue microarray +core images, annotated by an international group of 54 pathologists. The +annotations provide detailed localized pattern descriptions for Gleason grading +in line with international guidelines. Utilizing this dataset, we develop an +inherently explainable AI system based on a U-Net architecture that provides +predictions leveraging pathologists' terminology. This approach circumvents +post-hoc explainability methods while maintaining or exceeding the performance +of methods trained directly for Gleason pattern segmentation (Dice score: 0.713 +$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason +patterns). By employing soft labels during training, we capture the intrinsic +uncertainty in the data, yielding strong results in Gleason pattern +segmentation even in the context of high interobserver variability. With the +release of this dataset, we aim to encourage further research into segmentation +in medical tasks with high levels of subjectivity and to advance the +understanding of pathologists' reasoning processes. -摘要:Triton 是一種高階的類 Python 語言,專門用於建構高效的 GPU 核心,由於其可移植性、靈活性及可存取性,已廣泛採用於深度學習框架中。然而,編程和並行最佳化仍需要 Triton 開發人員進行大量的試驗和錯誤。儘管大型語言模型 (LLM) 在傳統程式碼產生方面取得了進展,但這些模型在產生準確且效能最佳化的 Triton 程式碼時仍面臨困難,因為它們缺乏對其規格和 GPU 編程複雜性的認識。更重要的是,迫切需要針對 Triton 量身打造的系統性評估。在這項工作中,我們介紹 TritonBench,這是第一個針對 Triton 算子產生進行全面評比的基準。TritonBench 具有兩個評估管道:一組來自 GitHub 的 184 個真實世界算子,以及一組與 PyTorch 介面對齊的算子。與優先考慮功能正確性的傳統程式碼基準不同,TritonBench 還剖析了與產業應用對齊的廣泛部署 GPU 上的效能表現。我們的研究表明,目前最先進的程式碼 LLM 難以產生高效的 Triton 算子,突顯了高性能程式碼產生中的重大差距。TritonBench 將在 https://github.com/thunlp/TritonBench 提供。 +摘要:前列腺癌是全球男性最常見的癌症,其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力,但這些預測通常缺乏內在的可解釋性,可能會導致對人機互動的不信任。為了解決這個問題,我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述,用於符合國際準則的 Gleason 分級。利用這個資料集,我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統,該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法,同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能(Dice 分數:0.713 ± 0.003,訓練於解釋,相對於 0.691 ± 0.010,訓練於 Gleason 模式)。透過在訓練期間採用軟標籤,我們捕捉了資料中的內在不確定性,即使在觀察者間變異性高的情況下,也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集,我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割,並增進對病理學家推理過程的理解。 -##### **Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs** -2502.14748v1 by Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, Jordan Boyd-Graber +##### **Explainable AI Methods for Multi-Omics Analysis: A Survey** +2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee -A common use of NLP is to facilitate the understanding of large document -collections, with a shift from using traditional topic models to Large Language -Models. Yet the effectiveness of using LLM for large corpus understanding in -real-world applications remains under-explored. This study measures the -knowledge users acquire with unsupervised, supervised LLM-based exploratory -approaches or traditional topic models on two datasets. While LLM-based methods -generate more human-readable topics and show higher average win probabilities -than traditional models for data exploration, they produce overly generic -topics for domain-specific datasets that do not easily allow users to learn -much about the documents. Adding human supervision to the LLM generation -process improves data exploration by mitigating hallucination and -over-genericity but requires greater human effort. In contrast, traditional. -models like Latent Dirichlet Allocation (LDA) remain effective for exploration -but are less user-friendly. We show that LLMs struggle to describe the haystack -of large corpora without human help, particularly domain-specific data, and -face scaling and hallucination limitations due to context length constraints. -Dataset available at https://huggingface. co/datasets/zli12321/Bills. +Advancements in high-throughput technologies have led to a shift from +traditional hypothesis-driven methodologies to data-driven approaches. +Multi-omics refers to the integrative analysis of data derived from multiple +'omes', such as genomics, proteomics, transcriptomics, metabolomics, and +microbiomics. This approach enables a comprehensive understanding of biological +systems by capturing different layers of biological information. Deep learning +methods are increasingly utilized to integrate multi-omics data, offering +insights into molecular interactions and enhancing research into complex +diseases. However, these models, with their numerous interconnected layers and +nonlinear relationships, often function as black boxes, lacking transparency in +decision-making processes. To overcome this challenge, explainable artificial +intelligence (xAI) methods are crucial for creating transparent models that +allow clinicians to interpret and work with complex data more effectively. This +review explores how xAI can improve the interpretability of deep learning +models in multi-omics research, highlighting its potential to provide +clinicians with clear insights, thereby facilitating the effective application +of such models in clinical settings. -摘要:NLP 的常見用途是促進對大型文件集合的理解,從使用傳統主題模型轉向大型語言模型。然而,在現實世界的應用中使用 LLM 了解大型語料庫的有效性仍未得到充分探索。本研究衡量了使用者在兩個資料集上使用無監督、監督的基於 LLM 的探索性方法或傳統主題模型獲得的知識。雖然基於 LLM 的方法會產生更多人類可讀的主題,並且顯示出比傳統模型更高的平均獲勝機率,但它們會為特定領域的資料集產生過於通用的主題,而這些主題不容易讓使用者對文件有深入了解。在 LLM 生成過程中加入人類監督可透過減輕幻覺和過度泛化來改善資料探索,但需要更多的人力。相反地,傳統模型(如潛在狄利克雷配置 (LDA))仍然有效於探索,但使用者友善度較低。我們表明,LLM 難以在沒有人類幫助的情況下描述大型語料庫的乾草堆,特別是特定領域的資料,並且會因上下文長度限制而面臨擴充性和幻覺限制。資料集可於 https://huggingface.co/datasets/zli12321/Bills 取得。 +摘要:高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料,例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面,能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料,提供分子交互作用的洞察力,並加強對複雜疾病的研究。然而,這些模型具有許多相互連接的層級和非線性關係,通常會像黑盒子一樣運作,缺乏決策過程的透明度。為了克服此挑戰,可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要,讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性,強調其提供臨床醫生明確見解的潛力,進而促進此類模型在臨床環境中的有效應用。 -##### **HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States** -2502.14744v1 by Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue +##### **Study on the Helpfulness of Explainable Artificial Intelligence** +2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing -The integration of additional modalities increases the susceptibility of -large vision-language models (LVLMs) to safety risks, such as jailbreak -attacks, compared to their language-only counterparts. While existing research -primarily focuses on post-hoc alignment techniques, the underlying safety -mechanisms within LVLMs remain largely unexplored. In this work , we -investigate whether LVLMs inherently encode safety-relevant signals within -their internal activations during inference. Our findings reveal that LVLMs -exhibit distinct activation patterns when processing unsafe prompts, which can -be leveraged to detect and mitigate adversarial inputs without requiring -extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a -novel tuning-free framework that harnesses internal model activations to -enhance safety. Experimental results show that {HiddenDetect} surpasses -state-of-the-art methods in detecting jailbreak attacks against LVLMs. By -utilizing intrinsic safety-aware patterns, our method provides an efficient and -scalable solution for strengthening LVLM robustness against multimodal threats. -Our code will be released publicly at -https://github.com/leigest519/HiddenDetect. +Explainable Artificial Intelligence (XAI) is essential for building advanced +machine learning-powered applications, especially in critical domains such as +medical diagnostics or autonomous driving. Legal, business, and ethical +requirements motivate using effective XAI, but the increasing number of +different methods makes it challenging to pick the right ones. Further, as +explanations are highly context-dependent, measuring the effectiveness of XAI +methods without users can only reveal a limited amount of information, +excluding human factors such as the ability to understand it. We propose to +evaluate XAI methods via the user's ability to successfully perform a proxy +task, designed such that a good performance is an indicator for the explanation +to provide helpful information. In other words, we address the helpfulness of +XAI for human decision-making. Further, a user study on state-of-the-art +methods was conducted, showing differences in their ability to generate trust +and skepticism and the ability to judge the rightfulness of an AI decision +correctly. Based on the results, we highly recommend using and extending this +approach for more objective-based human-centered user studies to measure XAI +performance in an end-to-end fashion. -摘要:整合其他模态会增加大型视觉语言模型 (LVLMs) 对安全风险的敏感性,例如越狱攻击,与仅语言的对应模型相比。虽然现有的研究主要集中于事后对齐技术,但 LVLMs 内部的基本安全机制在很大程度上仍未得到探索。在这项工作中,我们调查了 LVLMs 在推理过程中是否在其内部激活中固有地编码了与安全相关的信号。我们的研究结果表明,LVLMs 在处理不安全提示时表现出不同的激活模式,这可以用来检测和缓解对抗性输入,而无需进行广泛的微调。基于这一见解,我们引入了 HiddenDetect,这是一个新颖的无调优框架,利用内部模型激活来增强安全性。实验结果表明,{HiddenDetect} 在检测针对 LVLMs 的越狱攻击方面超越了最先进的方法。通过利用内在的安全感知模式,我们的方法为加强 LVLM 对多模态威胁的鲁棒性提供了一种高效且可扩展的解决方案。我们的代码将在 https://github.com/leigest519/HiddenDetect 公开发布。 +摘要:可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要,特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI,但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外,由於解釋高度依賴於背景,在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊,排除人類因素,例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法,設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說,我們探討 XAI 對人類決策制定的幫助。此外,對最先進的方法進行使用者研究,顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果,我們強烈建議使用和擴充這種方法,以進行更多以目標為基礎的人為中心使用者研究,以終端到終端的方式衡量 XAI 效能。 -##### **Multi-Agent Coordination across Diverse Applications: A Survey** -2502.14743v1 by Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin-Teng Lin, Yang Shen +##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health** +2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh -Multi-agent coordination studies the underlying mechanism enabling the -trending spread of diverse multi-agent systems (MAS) and has received -increasing attention, driven by the expansion of emerging applications and -rapid AI advances. This survey outlines the current state of coordination -research across applications through a unified understanding that answers four -fundamental coordination questions: (1) what is coordination; (2) why -coordination; (3) who to coordinate with; and (4) how to coordinate. Our -purpose is to explore existing ideas and expertise in coordination and their -connections across diverse applications, while identifying and highlighting -emerging and promising research directions. First, general coordination -problems that are essential to varied applications are identified and analyzed. -Second, a number of MAS applications are surveyed, ranging from widely studied -domains, e.g., search and rescue, warehouse automation and logistics, and -transportation systems, to emerging fields including humanoid and -anthropomorphic robots, satellite systems, and large language models (LLMs). -Finally, open challenges about the scalability, heterogeneity, and learning -mechanisms of MAS are analyzed and discussed. In particular, we identify the -hybridization of hierarchical and decentralized coordination, human-MAS -coordination, and LLM-based MAS as promising future directions. +Early detection of intrapartum risk enables interventions to potentially +prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, +there is no accurate automated system to predict such events to assist with +clinical decision-making. To fill this gap, we propose "Artificial Intelligence +(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning +framework that not only predicts adverse labor outcomes from maternal, fetal, +obstetrical, and intrapartum risk factors but also provides the model's +reasoning behind the predictions made. The latter can provide insights into +what modifications in the input variables of the model could have changed the +predicted outcome. We address the challenges of imbalance and small datasets by +synthesizing additional training data using Adaptive Synthetic Sampling +(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN +uses an ensemble of fully-connected neural networks as the backbone for its +classification with the data augmentation supported by either ADASYN or CTGAN. +AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in +classification. AIMEN can predict a high risk for adverse labor outcomes with +an average F1 score of 0.784. It also provides counterfactual explanations that +can be achieved by changing 2 to 3 attributes on average. Resources available: +https://github.com/ab9mamun/AIMEN. -摘要:多智能體協調研究探討了促成各種多智能體系統 (MAS) 流行擴散的底層機制,並隨著新興應用擴展和 AI 快速進展而受到越來越多的關注。這項調查透過統一的理解來概述協調研究的現狀,回答了四個基本的協調問題:(1) 什麼是協調;(2) 為什麼協調;(3) 與誰協調;以及 (4) 如何協調。我們的目的是探索協調中現有的想法和專業知識,以及它們在不同應用中的關聯,同時找出並強調新興且有前景的研究方向。首先,找出並分析了對各種應用至關重要的協調問題。其次,調查了許多 MAS 應用,範圍從廣泛研究的領域(例如搜尋和救援、倉庫自動化和物流,以及運輸系統),到新興領域,包括人形機器人和擬人機器人、衛星系統和大語言模型 (LLM)。最後,分析並討論了有關 MAS 的可擴充性、異質性和學習機制的開放挑戰。特別是,我們將分層協調和分散式協調、人類-MAS 協調和基於 LLM 的 MAS 的混合視為有前景的未來方向。 +摘要:產程中風險的早期偵測有助於進行干預措施,以預防或減輕不利的生產結果,例如腦性麻痺。目前,沒有準確的自動化系統可以預測此類事件,以協助臨床決策。為了填補這一空白,我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN),這是一個深度學習架構,它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果,還能提供模型做出預測背後的原因。後者可以提供見解,說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料,以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹,並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險,平均 F1 分數為 0.784。它還提供反事實解釋,可透過平均變更 2 至 3 個屬性來達成。可用資源:https://github.com/ab9mamun/AIMEN。 -##### **YOLOv12: A Breakdown of the Key Architectural Features** -2502.14740v1 by Mujadded Al Rabbani Alif, Muhammad Hussain +##### **Artificial intelligence techniques in inherited retinal diseases: A review** +2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian -This paper presents an architectural analysis of YOLOv12, a significant -advancement in single-stage, real-time object detection building upon the -strengths of its predecessors while introducing key improvements. The model -incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and -FlashAttention-driven area-based attention, improving feature extraction, -enhanced efficiency, and robust detections. With multiple model variants, -similar to its predecessors, YOLOv12 offers scalable solutions for both -latency-sensitive and high-accuracy applications. Experimental results manifest -consistent gains in mean average precision (mAP) and inference speed, making -YOLOv12 a compelling choice for applications in autonomous systems, security, -and real-time analytics. By achieving an optimal balance between computational -efficiency and performance, YOLOv12 sets a new benchmark for real-time computer -vision, facilitating deployment across diverse hardware platforms, from edge -devices to high-performance clusters. +Inherited retinal diseases (IRDs) are a diverse group of genetic disorders +that lead to progressive vision loss and are a major cause of blindness in +working-age adults. The complexity and heterogeneity of IRDs pose significant +challenges in diagnosis, prognosis, and management. Recent advancements in +artificial intelligence (AI) offer promising solutions to these challenges. +However, the rapid development of AI techniques and their varied applications +have led to fragmented knowledge in this field. This review consolidates +existing studies, identifies gaps, and provides an overview of AI's potential +in diagnosing and managing IRDs. It aims to structure pathways for advancing +clinical applications by exploring AI techniques like machine learning and deep +learning, particularly in disease detection, progression prediction, and +personalized treatment planning. Special focus is placed on the effectiveness +of convolutional neural networks in these areas. Additionally, the integration +of explainable AI is discussed, emphasizing its importance in clinical settings +to improve transparency and trust in AI-based systems. The review addresses the +need to bridge existing gaps in focused studies on AI's role in IRDs, offering +a structured analysis of current AI techniques and outlining future research +directions. It concludes with an overview of the challenges and opportunities +in deploying AI for IRDs, highlighting the need for interdisciplinary +collaboration and the continuous development of robust, interpretable AI models +to advance clinical applications. -摘要:本文提出 YOLOv12 的架構分析,這是在單階段即時物件偵測領域的重大進展,它建立在前任的優勢之上,同時引入了關鍵改進。該模型結合了最佳化的主幹 (R-ELAN)、7x7 可分離卷積和 FlashAttention 驅動的基於區域的注意力,改進了特徵提取、增強了效率和穩健的偵測。與其前身類似,YOLOv12 具有多種模型變體,為低延遲敏感型和高準確度應用程式提供了可擴充的解決方案。實驗結果顯示在平均準確度 (mAP) 和推論速度方面都有顯著的提升,這使得 YOLOv12 成為自動化系統、安全性和即時分析應用程式的理想選擇。透過在運算效率和效能之間取得最佳平衡,YOLOv12 為即時電腦視覺樹立了新的基準,促進了在各種硬體平台(從邊緣裝置到高性能叢集)上的部署。 +摘要:遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病, +會導致視力逐漸喪失,是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。 +然而,AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究,找出差距,並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術,特別是在疾病檢測、進程預測和個性化治療計劃中,為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外,討論了可解釋 AI 的整合,強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性,提供了對當前 AI 技術的結構化分析,並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇,強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。 -##### **SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines** -2502.14739v1 by M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang +##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures** +2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri -Large language models (LLMs) have demonstrated remarkable proficiency in -mainstream academic disciplines such as mathematics, physics, and computer -science. However, human knowledge encompasses over 200 specialized disciplines, -far exceeding the scope of existing benchmarks. The capabilities of LLMs in -many of these specialized fields-particularly in light industry, agriculture, -and service-oriented disciplines-remain inadequately evaluated. To address this -gap, we present SuperGPQA, a comprehensive benchmark that evaluates -graduate-level knowledge and reasoning capabilities across 285 disciplines. Our -benchmark employs a novel Human-LLM collaborative filtering mechanism to -eliminate trivial or ambiguous questions through iterative refinement based on -both LLM responses and expert feedback. Our experimental results reveal -significant room for improvement in the performance of current state-of-the-art -LLMs across diverse knowledge domains (e.g., the reasoning-focused model -DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting -the considerable gap between current model capabilities and artificial general -intelligence. Additionally, we present comprehensive insights from our -management of a large-scale annotation process, involving over 80 expert -annotators and an interactive Human-LLM collaborative system, offering valuable -methodological guidance for future research initiatives of comparable scope. +Explaining Artificial Intelligence (AI) decisions is a major challenge +nowadays in AI, in particular when applied to sensitive scenarios like medicine +and law. However, the need to explain the rationale behind decisions is a main +issue also for human-based deliberation as it is important to justify +\textit{why} a certain decision has been taken. Resident medical doctors for +instance are required not only to provide a (possibly correct) diagnosis, but +also to explain how they reached a certain conclusion. Developing new tools to +aid residents to train their explanation skills is therefore a central +objective of AI in education. In this paper, we follow this direction, and we +present, to the best of our knowledge, the first multilingual dataset for +Medical Question Answering where correct and incorrect diagnoses for a clinical +case are enriched with a natural language explanation written by doctors. These +explanations have been manually annotated with argument components (i.e., +premise, claim) and argument relations (i.e., attack, support), resulting in +the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases +in four languages (English, Spanish, French, Italian) with explanations, where +we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 +attack relations. We conclude by showing how competitive baselines perform over +this challenging dataset for the argument mining task. -摘要:大型語言模型 (LLM) 已展現出在主流學術領域(如數學、物理和電腦科學)的卓越能力。然而,人類知識包含超過 200 個專業領域,遠遠超過現有基準的範圍。LLM 在許多這些專業領域(特別是在輕工業、農業和服務導向領域)的能力仍未得到充分評估。為了解決這個差距,我們提出了 SuperGPQA,這是一個綜合基準,用於評估 285 個領域的研究生級知識和推理能力。我們的基準採用新穎的人類-LLM 協同過濾機制,透過基於 LLM 回應和專家回饋的迭代改進,來消除瑣碎或模稜兩可的問題。我們的實驗結果顯示,當前最先進的 LLM 在不同知識領域的表現仍有很大的改進空間(例如,以推理為重點的模型 DeepSeek-R1 在 SuperGPQA 上達到了 61.82% 的最高準確度),突顯了當前模型能力與人工通用智慧之間的巨大差距。此外,我們從管理大型註釋過程(涉及 80 多位專家註釋者和一個互動式人類-LLM 協作系統)中提出了全面的見解,為未來具有可比規模的研究計畫提供了寶貴的方法論指導。 +摘要:解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰,特別是應用於像醫學和法律等敏感情境時。然而,解釋決策背後理由的需求也是基於人類的考量的一個主要問題,因為有必要證明為什麼做出某個決策。例如,住院醫師不僅需要提供(可能是正確的)診斷,還需要解釋他們如何達成某個結論。因此,開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中,我們遵循這個方向,並且根據我們的了解,提出第一個多語言醫學問答資料集,其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成(即前提、主張)和論證關係(即攻擊、支持)進行手動註解,產生多語言 CasiMedicos-Arg 資料集,其中包含 558 個具有解釋的四種語言(英語、西班牙語、法語、義大利語)的臨床病例,我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。 -##### **EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration** -2502.14735v1 by Minjie Hong, Yan Xia, Zehan Wang, Jieming Zhu, Ye Wang, Sihang Cai, Xiaoda Yang, Quanyu Dai, Zhenhua Dong, Zhimeng Zhang, Zhou Zhao +##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration** +2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu -Large language models (LLMs) are increasingly leveraged as foundational -backbones in the development of advanced recommender systems, offering enhanced -capabilities through their extensive knowledge and reasoning. Existing -llm-based recommender systems (RSs) often face challenges due to the -significant differences between the linguistic semantics of pre-trained LLMs -and the collaborative semantics essential for RSs. These systems use -pre-trained linguistic semantics but learn collaborative semantics from scratch -via the llm-Backbone. However, LLMs are not designed for recommendations, -leading to inefficient collaborative learning, weak result correlations, and -poor integration of traditional RS features. To address these challenges, we -propose EAGER-LLM, a decoder-only llm-based generative recommendation framework -that integrates endogenous and exogenous behavioral and semantic information in -a non-intrusive manner. Specifically, we propose 1)dual-source knowledge-rich -item indices that integrates indexing sequences for exogenous signals, enabling -efficient link-wide processing; 2)non-invasive multiscale alignment -reconstruction tasks guide the model toward a deeper understanding of both -collaborative and semantic signals; 3)an annealing adapter designed to finely -balance the model's recommendation performance with its comprehension -capabilities. We demonstrate EAGER-LLM's effectiveness through rigorous testing -on three public benchmarks. +Diagnosis prediction is a critical task in healthcare, where timely and +accurate identification of medical conditions can significantly impact patient +outcomes. Traditional machine learning and deep learning models have achieved +notable success in this domain but often lack interpretability which is a +crucial requirement in clinical settings. In this study, we explore the use of +neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop +explainable models for diagnosis prediction. Essentially, we design and +implement LNN-based models that integrate domain-specific knowledge through +logical rules with learnable thresholds. Our models, particularly +$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior +performance over traditional models such as Logistic Regression, SVM, and +Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up +to 0.8457) in the case study of diabetes prediction. The learned weights and +thresholds within the LNN models provide direct insights into feature +contributions, enhancing interpretability without compromising predictive +power. These findings highlight the potential of neuro-symbolic approaches in +bridging the gap between accuracy and explainability in healthcare AI +applications. By offering transparent and adaptable diagnostic models, our work +contributes to the advancement of precision medicine and supports the +development of equitable healthcare solutions. Future research will focus on +extending these methods to larger and more diverse datasets to further validate +their applicability across different medical conditions and populations. -摘要:大型語言模型(LLM)正日益被用作先進推薦系統開發中的基礎主幹,透過其廣泛的知識和推理能力提供增強功能。現有的基於 LLM 的推薦系統(RS)通常會因為預先訓練的 LLM 語言語義與 RS 必備的協作語義之間的顯著差異而面臨挑戰。這些系統使用預先訓練的語言語義,但透過 LLM 主幹從頭學習協作語義。然而,LLM 並非專為推薦而設計,導致協作學習效率低落、結果關聯性薄弱,以及與傳統 RS 功能整合不佳。為了應對這些挑戰,我們提出 EAGER-LLM,這是一種僅解碼器、基於 LLM 的生成推薦架構,能以非侵入性方式整合內生和外生行為和語義資訊。具體來說,我們提出 1) 雙來源、知識豐富的項目索引,它整合了外生訊號的索引序列,實現了高效的鏈路廣泛處理;2) 非侵入式多尺度對齊重建任務引導模型更深入地理解協作和語義訊號;3) 退火適配器旨在精細地平衡模型的推薦效能與其理解能力。我們透過在三個公共基準上的嚴格測試證明了 EAGER-LLM 的有效性。 +摘要:診斷預測是醫療保健中的關鍵任務,及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功,但通常缺乏可解釋性,這在臨床環境中是一項關鍵要求。在本研究中,我們探討了神經符號方法的應用,特別是邏輯神經網路 (LNN),以開發用於診斷預測的可解釋模型。基本上,我們設計並實作了基於 LNN 的模型,這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型,特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$,表現出優於傳統模型(例如邏輯迴歸、SVM 和隨機森林)的優異效能,在糖尿病預測的案例研究中達到了更高的準確度(高達 80.52%)和 AUROC 分數(高達 0.8457)。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解,增強了可解釋性,同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型,我們的研究有助於推進精準醫療,並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集,以進一步驗證其在不同醫療狀況和人群中的適用性。 -##### **Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models** -2502.14734v1 by Hongji Li, Andrianos Michail, Reto Gubelmann, Simon Clematide, Juri Opitz +##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare** +2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty -We propose the Sentence Smith framework that enables controlled and specified -manipulation of text meaning. It consists of three main steps: 1. Parsing a -sentence into a semantic graph, 2. Applying human-designed semantic -manipulation rules, and 3. Generating text from the manipulated graph. A final -filtering step (4.) ensures the validity of the applied transformation. To -demonstrate the utility of Sentence Smith in an application study, we use it to -generate hard negative pairs that challenge text embedding models. Since the -controllable generation makes it possible to clearly isolate different types of -semantic shifts, we can gain deeper insights into the specific strengths and -weaknesses of widely used text embedding models, also addressing an issue in -current benchmarking where linguistic phenomena remain opaque. Human validation -confirms that the generations produced by Sentence Smith are highly accurate. +The rapid advancements in artificial intelligence (AI) have revolutionized +smart healthcare, driving innovations in wearable technologies, continuous +monitoring devices, and intelligent diagnostic systems. However, security, +explainability, robustness, and performance optimization challenges remain +critical barriers to widespread adoption in clinical environments. This +research presents an innovative algorithmic method using the Adaptive Feature +Evaluator (AFE) algorithm to improve feature selection in healthcare datasets +and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable +Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT), +the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby +enhancing predictive accuracy and interpretability. The proposed method is +validated across three diverse healthcare datasets using six distinct machine +learning algorithms, demonstrating its robustness and superiority over +conventional feature selection techniques. The results underscore the +transformative potential of AFE in smart healthcare, enabling personalized and +transparent patient care. Notably, the AFE algorithm, when combined with a +Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting +its capability to improve clinical decision-making processes in real-world +healthcare applications. -摘要:我們提出 Sentence Smith 框架,它能控制並指定文本含義的處理。它包含三個主要步驟:1. 將句子解析成語義圖形,2. 套用人為設計的語義處理規則,3. 從處理過的圖形生成文本。最後的過濾步驟 (4.) 確保套用轉換的有效性。為了在應用研究中展示 Sentence Smith 的效用,我們使用它來產生挑戰文本嵌入模型的困難負面對。由於可控生成能清楚地隔離不同類型的語義轉移,我們能更深入地了解廣泛使用的文本嵌入模型的具體優點和缺點,同時也解決了語言現象在當前基準測試中仍然不透明的問題。人為驗證確認 Sentence Smith 產生的生成高度準確。 +摘要:人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健,推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而,安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法,使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT),該演算法最佳化了臨床決策支援系統 (CDSS),從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集,證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力,實現了個人化和透明的患者照護。值得注意的是,AFE 演算法與多層感知器 (MLP) 結合使用時,準確度高達 98.5%,突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。 -##### **WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models** -2502.14727v1 by Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao +##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study** +2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker -Retrieval Augmented Generation (RAG) has gained widespread adoption owing to -its capacity to empower large language models (LLMs) to integrate external -knowledge. However, existing RAG frameworks are primarily designed for -text-based LLMs and rely on Automatic Speech Recognition to process speech -input, which discards crucial audio information, risks transcription errors, -and increases computational overhead. Therefore, we introduce WavRAG, the first -retrieval augmented generation framework with native, end-to-end audio support. -WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw -audio for both embedding and retrieval. 2) WavRAG integrates audio and text -into a unified knowledge representation. Specifically, we propose the -WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge -base, and further enhance the in-context capabilities of spoken dialogue models -through the integration of chain-of-thought reasoning. In comparison to -state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval -performance while delivering a 10x acceleration. Furthermore, WavRAG's unique -text-audio hybrid retrieval capability extends the boundaries of RAG to the -audio modality. +Artificial intelligence (AI) systems have substantially improved +dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI) +systems further enhancing clinicians' confidence and trust in AI-driven +decisions. Despite these advancements, there remains a critical need for +objective evaluation of how dermatologists engage with both AI and XAI tools. +In this study, 76 dermatologists participated in a reader study, diagnosing 16 +dermoscopic images of melanomas and nevi using an XAI system that provides +detailed, domain-specific explanations. Eye-tracking technology was employed to +assess their interactions. Diagnostic performance was compared with that of a +standard AI system lacking explanatory features. Our findings reveal that XAI +systems improved balanced diagnostic accuracy by 2.8 percentage points relative +to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and +complex lesions were associated with elevated cognitive load, as evidenced by +increased ocular fixations. These insights have significant implications for +clinical practice, the design of AI tools for visual tasks, and the broader +development of XAI in medical diagnostics. -摘要:檢索增強生成 (RAG) 因其賦能大型語言模型 (LLM) 整合外部知識的能力而獲得廣泛採用。然而,現有的 RAG 框架主要設計用於基於文字的 LLM,並依賴自動語音辨識處理語音輸入,這會捨棄重要的音訊資訊、有轉錄錯誤的風險,並增加運算負擔。因此,我們引入了 WavRAG,這是第一個具備原生端對端音訊支援的檢索增強生成框架。WavRAG 提供兩個主要功能:1) 繞過 ASR,WavRAG 直接處理原始音訊以進行嵌入和檢索。2) WavRAG 將音訊和文字整合到統一的知識表示中。具體來說,我們提出了 WavRetriever 以利於從文字音訊混合知識庫中進行檢索,並透過整合思考鏈推理進一步增強對話模型的語境能力。與最先進的 ASR 文字 RAG 管線相比,WavRAG 達到了相當的檢索效能,同時提供了 10 倍的加速。此外,WavRAG 獨特的文字音訊混合檢索能力將 RAG 的界線延伸到音訊模式。 +摘要:人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度,而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展,對於皮膚科醫師如何使用 AI 和 XAI 工具,仍有客觀評估的迫切需求。在這項研究中,76 位皮膚科醫師參與了一項讀者研究,使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像,該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示,XAI 系統相較於標準 AI,將平衡診斷準確度提升了 2.8 個百分點。此外,與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關,這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。 -##### **Entity Framing and Role Portrayal in the News** -2502.14718v1 by Tarek Mahmoud, Zhuohan Xie, Dimitar Dimitrov, Nikolaos Nikolaidis, Purificação Silvano, Roman Yangarber, Shivam Sharma, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov +##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data** +2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar -We introduce a novel multilingual hierarchical corpus annotated for entity -framing and role portrayal in news articles. The dataset uses a unique taxonomy -inspired by storytelling elements, comprising 22 fine-grained roles, or -archetypes, nested within three main categories: protagonist, antagonist, and -innocent. Each archetype is carefully defined, capturing nuanced portrayals of -entities such as guardian, martyr, and underdog for protagonists; tyrant, -deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for -innocents. The dataset includes 1,378 recent news articles in five languages -(Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two -critical domains of global significance: the Ukraine-Russia War and Climate -Change. Over 5,800 entity mentions have been annotated with role labels. This -dataset serves as a valuable resource for research into role portrayal and has -broader implications for news analysis. We describe the characteristics of the -dataset and the annotation process, and we report evaluation results on -fine-tuned state-of-the-art multilingual transformers and hierarchical -zero-shot learning using LLMs at the level of a document, a paragraph, and a -sentence. +Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been +shown to significantly improve the quality of life of autistic individuals. +However, diagnostics methods for ASD rely on assessments based on clinical +presentation that are prone to bias and can be challenging to arrive at an +early diagnosis. There is a need for objective biomarkers of ASD which can help +improve diagnostic accuracy. Deep learning (DL) has achieved outstanding +performance in diagnosing diseases and conditions from medical imaging data. +Extensive research has been conducted on creating models that classify ASD +using resting-state functional Magnetic Resonance Imaging (fMRI) data. However, +existing models lack interpretability. This research aims to improve the +accuracy and interpretability of ASD diagnosis by creating a DL model that can +not only accurately classify ASD but also provide explainable insights into its +working. The dataset used is a preprocessed version of the Autism Brain Imaging +Data Exchange (ABIDE) with 884 samples. Our findings show a model that can +accurately classify ASD and highlight critical brain regions differing between +ASD and typical controls, with potential implications for early diagnosis and +understanding of the neural basis of ASD. These findings are validated by +studies in the literature that use different datasets and modalities, +confirming that the model actually learned characteristics of ASD and not just +the dataset. This study advances the field of explainable AI in medical imaging +by providing a robust and interpretable model, thereby contributing to a future +with objective and reliable ASD diagnostics. -摘要:我們引進一個新穎的多語言層級語料庫,其中註解了新聞文章中的實體框架和角色描繪。此資料集使用了一個獨特的分類法,其靈感來自講故事元素,包含 22 個細緻的角色或原型,嵌套在三個主要類別中:主角、對手和無辜者。每個原型都經過仔細定義,捕捉了實體的細微描繪,例如主角的監護人、烈士和弱者;對手的暴君、欺騙者和偏執狂;以及無辜者的受害者、替罪羊和被剝削者。該資料集包括五種語言(保加利亞語、英語、印地語、歐洲葡萄牙語和俄語)中的 1,378 篇近期新聞文章,重點關注兩個具有全球意義的關鍵領域:烏克蘭-俄羅斯戰爭和氣候變遷。超過 5,800 個實體提及已註解為角色標籤。此資料集作為角色描繪研究的寶貴資源,並對新聞分析有更廣泛的影響。我們描述了資料集的特徵和註解過程,並報告了對使用 LLM 在文件、段落和句子層級進行微調的最新多語言轉換器和層級零次學習的評估結果。 +摘要:自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而,ASD 的診斷方法依賴於基於臨床表現的評估,容易產生偏見,且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記,以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而,現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD,還能提供可解釋見解說明其運作原理的 DL 模型,來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本,包含 884 個樣本。我們的研究結果顯示,該模型能準確分類 ASD,並強調 ASD 與典型對照組之間存在差異的關鍵腦區,對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證,證實該模型實際上學習了 ASD 的特徵,而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型,推動了醫學影像中可解釋 AI 的領域,從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。 -##### **From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT** -2502.14714v1 by Ahmed Abdeen Hamed, Byung Suk Lee +##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition** +2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul -The generative capabilities of LLM models present opportunities in -accelerating tasks and concerns with the authenticity of the knowledge it -produces. To address the concerns, we present a computational approach that -systematically evaluates the factual accuracy of biomedical knowledge that an -LLM model has been prompted to generate. Our approach encompasses two -processes: the generation of disease-centric associations and the verification -of them using the semantic knowledge of the biomedical ontologies. Using -ChatGPT as the select LLM model, we designed a set of prompt-engineering -processes to generate linkages between diseases, drugs, symptoms, and genes to -establish grounds for assessments. Experimental results demonstrate high -accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and -genetic information (88%-98%). The symptom term identification accuracy was -notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO -ontologies accordingly. The verification of associations reveals literature -coverage rates of (89%-91%) among disease-drug and disease-gene associations. -The low identification accuracy for symptom terms also contributed to the -verification of symptom-related associations (49%-62%). +The in-vivo identification of the kidney stone types during an ureteroscopy +would be a major medical advance in urology, as it could reduce the time of the +tedious renal calculi extraction process, while diminishing infection risks. +Furthermore, such an automated procedure would make possible to prescribe +anti-recurrence treatments immediately. Nowadays, only few experienced +urologists are able to recognize the kidney stone types in the images of the +videos displayed on a screen during the endoscopy. Thus, several deep learning +(DL) models have recently been proposed to automatically recognize the kidney +stone types using ureteroscopic images. However, these DL models are of black +box nature whicl limits their applicability in clinical settings. This +contribution proposes a case-based reasoning DL model which uses prototypical +parts (PPs) and generates local and global descriptors. The PPs encode for each +class (i.e., kidney stone type) visual feature information (hue, saturation, +intensity and textures) similar to that used by biologists. The PPs are +optimally generated due a new loss function used during the model training. +Moreover, the local and global descriptors of PPs allow to explain the +decisions ("what" information, "where in the images") in an understandable way +for biologists and urologists. The proposed DL model has been tested on a +database including images of the six most widespread kidney stone types. The +overall average classification accuracy was 90.37. When comparing this results +with that of the eight other DL models of the kidney stone state-of-the-art, it +can be seen that the valuable gain in explanability was not reached at the +expense of accuracy which was even slightly increased with respect to that +(88.2) of the best method of the literature. These promising and interpretable +results also encourage urologists to put their trust in AI-based solutions. -摘要:LLM 模型的生成能力為加速任務和對其產生的知識真實性的疑慮提供了機會。為了解決這些疑慮,我們提出了計算方法,系統性評估 LLM 模型受提示而產生的生物醫學知識的事實準確性。我們的做法包括兩個過程:生成以疾病為中心的關聯,並使用生物醫學本体的語義知識驗證它們。使用 ChatGPT 作為選定的 LLM 模型,我們設計了一組提示工程流程,以生成疾病、藥物、症狀和基因之間的關聯,作為評估的依據。實驗結果證明在識別疾病術語 (88%-97%)、藥物名稱 (90%-91%) 和遺傳資訊 (88%-98%) 方面具有很高的準確性。症狀術語識別準確性顯著較低 (49%-61%),並根據 DOID、ChEBI、SYMPTOM 和 GO 本体進行驗證。關聯驗證顯示疾病-藥物和疾病-基因關聯的文獻覆蓋率為 (89%-91%)。症狀術語的低識別準確性也影響了症狀相關關聯的驗證 (49%-62%)。 +摘要:尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展,因為它可以減少繁瑣的腎結石取出過程的時間,同時降低感染風險。此外,這種自動化程序將使立即開立抗復發治療成為可能。如今,只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此,最近已提出多種深度學習 (DL) 模型,以使用輸尿管鏡圖像自動識別腎結石類型。然而,這些 DL 模型本質上是黑盒子,這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型,它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型(即腎結石類型)編碼視覺特徵信息(色調、飽和度、強度和紋理),類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數,PP 得到了最佳生成。此外,PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策(“什麼”信息,“圖像中的什麼位置”)。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時,可以看出,可解釋性的寶貴增益並未以準確性為代價,甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。 -##### **Data-Efficient Pretraining with Group-Level Data Influence Modeling** -2502.14709v1 by Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong +##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques** +2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman -Data-efficient pretraining has shown tremendous potential to elevate scaling -laws. This paper argues that effective pretraining data should be curated at -the group level, treating a set of data points as a whole rather than as -independent contributors. To achieve that, we propose Group-Level Data -Influence Modeling (Group-MATES), a novel data-efficient pretraining method -that captures and optimizes group-level data utility. Specifically, Group-MATES -collects oracle group-level influences by locally probing the pretraining model -with data sets. It then fine-tunes a relational data influence model to -approximate oracles as relationship-weighted aggregations of individual -influences. The fine-tuned model selects the data subset by maximizing its -group-level influence prediction, with influence-aware clustering to enable -efficient inference. Experiments on the DCLM benchmark demonstrate that -Group-MATES achieves a 10% relative core score improvement on 22 downstream -tasks over DCLM-Baseline and 5% over individual-influence-based methods, -establishing a new state-of-the-art. Further analyses highlight the -effectiveness of relational data influence models in capturing intricate -interactions between data points. +This study explores the potential of utilizing administrative claims data, +combined with advanced machine learning and deep learning techniques, to +predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal +Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major +health insurance organization to develop prediction models for multiple +observation windows using traditional machine learning methods such as Random +Forest and XGBoost as well as deep learning approaches such as Long Short-Term +Memory (LSTM) networks. Our findings demonstrate that the LSTM model, +particularly with a 24-month observation window, exhibits superior performance +in predicting ESRD progression, outperforming existing models in the +literature. We further apply SHapley Additive exPlanations (SHAP) analysis to +enhance interpretability, providing insights into the impact of individual +features on predictions at the individual patient level. This study underscores +the value of leveraging administrative claims data for CKD management and +predicting ESRD progression. -摘要:資料有效的預訓練已展現出提升規模化定律的巨大潛力。本文認為,有效的預訓練資料應在群組層級中進行策展,將資料點集合視為一個整體,而非獨立的貢獻者。為達成此目的,我們提出群組層級資料影響建模(Group-MATES),這是一種新穎的資料有效預訓練方法,可擷取和最佳化群組層級資料效用。具體而言,Group-MATES 透過使用資料集在區域探測預訓練模型,收集神諭群組層級影響。接著,微調關係資料影響模型,以關係加權聚合個別影響來近似神諭。微調模型透過最大化其群組層級影響預測,選取資料子集,並透過考量影響的群集,啟用有效率的推論。在 DCLM 基準上的實驗證明,與 DCLM-Baseline 相比,Group-MATES 在 22 個下游任務上達成 10% 的相對核心分數提升,並比基於個別影響的方法高出 5%,建立了新的技術水準。進一步的分析強調了關係資料影響模型在擷取資料點之間的複雜互動上的有效性。 +摘要:本研究探討利用行政申報資料,結合先進機器學習與深度學習技術,預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集,使用傳統機器學習方法(例如隨機森林和 XGBoost)以及深度學習方法(例如長期短期記憶 (LSTM) 網路)開發多個觀察視窗的預測模型。我們的研究結果顯示,LSTM 模型(尤其是 24 個月觀察視窗)在預測 ESRD 進展方面表現優異,優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性,深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。 -##### **Human Misperception of Generative-AI Alignment: A Laboratory Experiment** -2502.14708v1 by Kevin He, Ran Shorrer, Mengjia Xia +##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases** +2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller -We conduct an incentivized laboratory experiment to study people's perception -of generative artificial intelligence (GenAI) alignment in the context of -economic decision-making. Using a panel of economic problems spanning the -domains of risk, time preference, social preference, and strategic -interactions, we ask human subjects to make choices for themselves and to -predict the choices made by GenAI on behalf of a human user. We find that -people overestimate the degree of alignment between GenAI's choices and human -choices. In every problem, human subjects' average prediction about GenAI's -choice is substantially closer to the average human-subject choice than it is -to the GenAI choice. At the individual level, different subjects' predictions -about GenAI's choice in a given problem are highly correlated with their own -choices in the same problem. We explore the implications of people -overestimating GenAI alignment in a simple theoretical model. +While large language models (LLMs) have shown promise for medical question +answering, there is limited work focused on tropical and infectious +disease-specific exploration. We build on an opensource tropical and infectious +diseases (TRINDs) dataset, expanding it to include demographic and semantic +clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM +performance on these, comparing generalist and medical LLMs, as well as LLM +outcomes to human experts. We demonstrate through systematic experimentation, +the benefit of contextual information such as demographics, location, gender, +risk factors for optimal LLM response. Finally we develop a prototype of +TRINDs-LM, a research tool that provides a playground to navigate how context +impacts LLM outputs for health. -摘要:我們進行一項誘因實驗室實驗,以研究人們對生成式人工智慧 (GenAI) 在經濟決策制定中的對齊認知。使用涵蓋風險、時間偏好、社會偏好和策略性互動領域的經濟問題小組,我們要求受試者為自己做出選擇,並預測 GenAI 代表人類使用者做出的選擇。我們發現人們高估了 GenAI 選擇和人類選擇之間的對齊程度。在每個問題中,受試者對 GenAI 選擇的平均預測都比對 GenAI 選擇的預測更接近於平均人類受試者選擇。在個人層面上,不同受試者對特定問題中 GenAI 選擇的預測與他們在同一個問題中的選擇高度相關。我們在一個簡單的理論模型中探討了人們高估 GenAI 對齊的影響。 +摘要:儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景,但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上,並將其擴展為納入人口統計和語義臨床和消費者擴充,產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能,比較了通才和醫療 LLM,以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊(例如人口統計、位置、性別、最佳 LLM 回應的風險因素)的好處。最後,我們開發了 TRINDs-LM 的原型,這是一個研究工具,提供一個探索背景如何影響 LLM 健康輸出的平台。 -##### **Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting** -2502.14704v1 by Yuxuan Yang, Dalin Zhang, Yuxuan Liang, Hua Lu, Huan Li, Gang Chen +##### **Explainable AI: Definition and attributes of a good explanation for health AI** +2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group -Time Series Forecasting (TSF) is a crucial task in various domains, yet -existing TSF models rely heavily on high-quality data and insufficiently -exploit all available data. This paper explores a novel self-supervised -approach to re-label time series datasets by inherently constructing candidate -datasets. During the optimization of a simple reconstruction network, -intermediates are used as pseudo labels in a self-supervised paradigm, -improving generalization for any predictor. We introduce the Self-Correction -with Adaptive Mask (SCAM), which discards overfitted components and selectively -replaces them with pseudo labels generated from reconstructions. Additionally, -we incorporate Spectral Norm Regularization (SNR) to further suppress -overfitting from a loss landscape perspective. Our experiments on eleven -real-world datasets demonstrate that SCAM consistently improves the performance -of various backbone models. This work offers a new perspective on constructing -datasets and enhancing the generalization of TSF models through self-supervised -learning. +Proposals of artificial intelligence (AI) solutions based on increasingly +complex and accurate predictive models are becoming ubiquitous across many +disciplines. As the complexity of these models grows, transparency and users' +understanding often diminish. This suggests that accurate prediction alone is +insufficient for making an AI-based solution truly useful. In the development +of healthcare systems, this introduces new issues related to accountability and +safety. Understanding how and why an AI system makes a recommendation may +require complex explanations of its inner workings and reasoning processes. +Although research on explainable AI (XAI) has significantly increased in recent +years and there is high demand for XAI in medicine, defining what constitutes a +good explanation remains ad hoc, and providing adequate explanations continues +to be challenging. To fully realize the potential of AI, it is critical to +address two fundamental questions about explanations for safety-critical AI +applications, such as health-AI: (1) What is an explanation in health-AI? and +(2) What are the attributes of a good explanation in health-AI? In this study, +we examined published literature and gathered expert opinions through a +two-round Delphi study. The research outputs include (1) a definition of what +constitutes an explanation in health-AI and (2) a comprehensive list of +attributes that characterize a good explanation in health-AI. -摘要:時間序列預測 (TSF) 在各個領域中都是一項重要的任務,但現有的 TSF 模型極度依賴高品質的資料,且無法充分利用所有可用的資料。本文探討了一種新穎的自監督方法,藉由內建地建構候選資料集來重新標記時間序列資料集。在最佳化一個簡單的重建網路過程中,中間產物會在自監督範例中作為偽標籤,進而改善任何預測器的概化能力。我們引入了帶有自適應遮罩 (SCAM) 的自我修正,它會捨棄過度擬合的組成,並選擇性地以從重建產生的偽標籤取代它們。此外,我們納入了頻譜範數正規化 (SNR) 來進一步抑制從損失景觀觀點來看產生的過度擬合。我們在 11 個真實世界的資料集上進行的實驗,證明 SCAM 持續改善各種主幹模型的效能。這項工作提供了建構資料集和透過自監督學習來提升 TSF 模型概化能力的新觀點。 +摘要:隨著越來越複雜且準確的預測模型,基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加,透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中,這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加,且醫學領域對 XAI 有很高的需求,但定義什麼構成一個好的解釋仍是臨時性的,而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力,對於安全關鍵型 AI 應用(例如健康 AI)的解釋,探討兩個基本問題至關重要:(1) 什麼是健康 AI 中的解釋?以及 (2) 健康 AI 中一個好的解釋有哪些屬性?在本研究中,我們檢視了已發表的文獻,並透過兩輪德爾菲研究收集了專家意見。研究成果包括:(1) 健康 AI 中什麼構成解釋的定義,以及 (2) 健康 AI 中一個好解釋的屬性清單。 -##### **I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search** -2502.14693v1 by Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu +##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust in Healthcare** +2408.17401v2 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni -Recent advancements in large language models (LLMs) have shown remarkable -potential in automating machine learning tasks. However, existing LLM-based -agents often struggle with low-diversity and suboptimal code generation. While -recent work has introduced Monte Carlo Tree Search (MCTS) to address these -issues, limitations persist in the quality and diversity of thoughts generated, -as well as in the scalar value feedback mechanisms used for node selection. In -this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a -novel approach that iteratively expands tree nodes through an introspective -process that meticulously analyzes solutions and results from parent and -sibling nodes. This facilitates a continuous refinement of the node in the -search tree, thereby enhancing the overall decision-making process.Furthermore, -we integrate a Large Language Model (LLM)-based value model to facilitate -direct evaluation of each node's solution prior to conducting comprehensive -computational rollouts. A hybrid rewarding mechanism is implemented to -seamlessly transition the Q-value from LLM-estimated scores to actual -performance scores. This allows higher-quality nodes to be traversed -earlier.Applied to the various ML tasks, our approach demonstrates a6\% -absolute improvement in performance compared to the strong open-source AutoML -agents, showcasing its effectiveness in enhancing agentic AutoML systems. +AI-driven tools for healthcare are widely acknowledged as potentially +beneficial to health practitioners and patients, e.g. the QCancer regression +tool for cancer risk prediction. However, for these tools to be trusted, they +need to be supplemented with explanations. We examine how explanations' content +and format affect user comprehension and trust when explaining QCancer's +predictions. Regarding content, we deploy SHAP and Occlusion-1. Regarding +format, we present SHAP explanations, conventionally, as charts (SC) and +Occlusion-1 explanations as charts (OC) as well as text (OT), to which their +simpler nature lends itself. We conduct experiments with two sets of +stakeholders: the general public (representing patients) and medical students +(representing healthcare practitioners). Our experiments showed higher +subjective comprehension and trust for Occlusion-1 over SHAP explanations based +on content. However, when controlling for format, only OT outperformed SC, +suggesting this trend is driven by preferences for text. Other findings +corroborated that explanation format, rather than content, is often the +critical factor. -摘要:大型語言模型 (LLM) 的最新進展已展現出自動化機器學習任務的顯著潛力。然而,現有的基於 LLM 的代理通常會遇到低多樣性和次優代碼生成的問題。雖然最近的工作已引入蒙地卡羅樹搜尋 (MCTS) 來解決這些問題,但仍存在於所產生想法的品質和多樣性,以及用於節點選擇的標量值回饋機制中。在本研究中,我們介紹了內省蒙地卡羅樹搜尋 (I-MCTS),這是一種透過內省過程反覆擴展樹節點的新方法,該過程會細緻地分析來自父節點和同層節點的解決方案和結果。這有助於持續改善搜尋樹中的節點,進而增強整體決策制定過程。此外,我們整合了一個基於大型語言模型 (LLM) 的值模型,以便在進行全面運算展開之前直接評估每個節點的解決方案。實作了一種混合獎勵機制,以無縫地將 Q 值從 LLM 估計分數轉換為實際效能分數。這允許較高品質的節點更早被遍歷。應用於各種 ML 任務,我們的做法展示出比強大的開源 AutoML 代理高出 6% 的絕對效能提升,證明了其在增強代理式 AutoML 系統方面的有效性。 +摘要:由 AI 驅動的醫療保健工具被廣泛認為對醫療從業者和患者有潛在好處,例如用於癌症風險預測的 QCancer 回歸工具。然而,對於這些工具,如果要讓人們信賴,就需要補充說明。我們研究了說明的內容和格式如何影響使用者在解釋 QCancer 預測時的理解和信任。關於內容,我們部署了 SHAP 和 Occlusion-1。關於格式,我們以圖表 (SC) 的形式呈現 SHAP 說明,以圖表 (OC) 和文字 (OT) 的形式呈現 Occlusion-1 說明,因為它們的性質較為簡單。我們對兩組利害關係人進行了實驗:一般民眾(代表患者)和醫學生(代表醫療從業者)。我們的實驗結果顯示,基於內容,Occlusion-1 比 SHAP 說明具有更高的主觀理解和信任。然而,在控制格式時,只有 OT 優於 SC,這表明這種趨勢是由對文字的偏好所驅動的。其他發現證實了說明格式,而不是內容,通常是關鍵因素。 -##### **Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup** -2502.14682v1 by Yonghui Kong, Hongbing Hu, Dan Zhang, Siyuan Chai, Fan Zhang, Wei Wang +##### **A Survey for Large Language Models in Biomedicine** +2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen -Large language models have demonstrated excellent performance in many tasks, -including Text-to-SQL, due to their powerful in-context learning capabilities. -They are becoming the mainstream approach for Text-to-SQL. However, these -methods still have a significant gap compared to human performance, especially -on complex questions. As the complexity of questions increases, the gap between -questions and SQLs increases. We identify two important gaps: the structural -mapping gap and the lexical mapping gap. To tackle these two gaps, we propose -PAS-SQL, an efficient SQL generation pipeline based on LLMs, which alleviates -gaps through Abstract Query Pattern (AQP) and Contextual Schema Markup (CSM). -AQP aims to obtain the structural pattern of the question by removing -database-related information, which enables us to find structurally similar -demonstrations. CSM aims to associate database-related text span in the -question with specific tables or columns in the database, which alleviates the -lexical mapping gap. Experimental results on the Spider and BIRD datasets -demonstrate the effectiveness of our proposed method. Specifically, PAS-SQL + -GPT-4o sets a new state-of-the-art on the Spider benchmark with an execution -accuracy of 87.9\%, and achieves leading results on the BIRD dataset with an -execution accuracy of 64.67\%. +Recent breakthroughs in large language models (LLMs) offer unprecedented +natural language understanding and generation capabilities. However, existing +surveys on LLMs in biomedicine often focus on specific applications or model +architectures, lacking a comprehensive analysis that integrates the latest +advancements across various biomedical domains. This review, based on an +analysis of 484 publications sourced from databases including PubMed, Web of +Science, and arXiv, provides an in-depth examination of the current landscape, +applications, challenges, and prospects of LLMs in biomedicine, distinguishing +itself by focusing on the practical implications of these models in real-world +biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot +learning across a broad spectrum of biomedical tasks, including diagnostic +assistance, drug discovery, and personalized medicine, among others, with +insights drawn from 137 key studies. Then, we discuss adaptation strategies of +LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to +enhance their performance in specialized biomedical contexts where zero-shot +fails to achieve, such as medical question answering and efficient processing +of biomedical literature. Finally, we discuss the challenges that LLMs face in +the biomedicine domain including data privacy concerns, limited model +interpretability, issues with dataset quality, and ethics due to the sensitive +nature of biomedical data, the need for highly reliable model outputs, and the +ethical implications of deploying AI in healthcare. To address these +challenges, we also identify future research directions of LLM in biomedicine +including federated learning methods to preserve data privacy and integrating +explainable AI methodologies to enhance the transparency of LLMs. + +摘要:大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而,現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構,缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析,深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景,其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先,我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力,包括診斷輔助、藥物發現和個性化醫療等,並從 137 項關鍵研究中汲取見解。然後,我們討論了 LLM 的適應策略,包括單模態和多模態 LLM 的微調方法,以增強它們在零次學習無法實現的專業生物醫學背景中的性能,例如醫療問題解答和生物醫學文獻的有效處理。最後,我們討論了 LLM 在生物醫學領域面臨的挑戰,包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰,我們還確定了生物醫學中 LLM 未來的研究方向,包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。 + +##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis** +2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone + +Significant investment and development have gone into integrating Artificial +Intelligence (AI) in medical and healthcare applications, leading to advanced +control systems in medical technology. However, the opacity of AI systems +raises concerns about essential characteristics needed in such sensitive +applications, like transparency and trustworthiness. Our study addresses these +concerns by investigating a process for selecting the most adequate Explainable +AI (XAI) methods to comply with the explanation requirements of key EU +regulations in the context of smart bioelectronics for medical devices. The +adopted methodology starts with categorising smart devices by their control +mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving +into their technology. Then, we analyse these regulations to define their +explainability requirements for the various devices and related goals. +Simultaneously, we classify XAI methods by their explanatory objectives. This +allows for matching legal explainability requirements with XAI explanatory +goals and determining the suitable XAI algorithms for achieving them. Our +findings provide a nuanced understanding of which XAI algorithms align better +with EU regulations for different types of medical devices. We demonstrate this +through practical case studies on different neural implants, from chronic +disease management to advanced prosthetics. This study fills a crucial gap in +aligning XAI applications in bioelectronics with stringent provisions of EU +regulations. It provides a practical framework for developers and researchers, +ensuring their AI innovations advance healthcare technology and adhere to legal +and ethical standards. -摘要:大型語言模型在許多任務中表現出色,包括文字轉 SQL,這歸功於它們強大的情境學習能力。它們正成為文字轉 SQL 的主流方法。然而,這些方法與人類的表現仍有顯著差距,特別是在複雜的問題上。隨著問題的複雜性增加,問題和 SQL 之間的差距也隨之增加。我們找出兩個重要的差距:結構對應差距和詞彙對應差距。為了解決這兩個差距,我們提出 PAS-SQL,一種基於 LLM 的高效 SQL 產生管道,它透過抽象查詢模式 (AQP) 和情境架構標記 (CSM) 來縮小差距。AQP 旨在透過移除與資料庫相關的資訊來取得問題的結構模式,這使我們能夠找到結構上相似的範例。CSM 旨在將問題中與資料庫相關的文字範圍與資料庫中的特定表格或欄位關聯起來,這可以縮小詞彙對應差距。在 Spider 和 BIRD 資料集上的實驗結果證明了我們所提出的方法的有效性。具體來說,PAS-SQL + GPT-4o 在 Spider 基準測試中設定了一個新的技術水準,執行準確度為 87.9%,並在 BIRD 資料集上取得領先的結果,執行準確度為 64.67%。 +摘要:人工智慧(AI)在醫療和保健應用中投入了大量的投資和開發,進而導致醫療技術中的先進控制系統。然而,AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂,例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題,用於選擇最充分的可解釋 AI(XAI)方法,以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制(開迴路、閉迴路和半閉迴路系統)對智慧型裝置進行分類,並深入探討其技術開始。然後,我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時,我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配,並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點,從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構,確保其 AI 創新能促進醫療技術並遵守法律和道德標準。 -##### **How to Get Your LLM to Generate Challenging Problems for Evaluation** -2502.14678v1 by Arkil Patel, Siva Reddy, Dzmitry Bahdanau +##### **Towards Case-based Interpretability for Medical Federated Learning** +2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva -The pace of evolution of Large Language Models (LLMs) necessitates new -approaches for rigorous and comprehensive evaluation. Traditional human -annotation is increasingly impracticable due to the complexities and costs -involved in generating high-quality, challenging problems. In this work, we -introduce CHASE, a unified framework to synthetically generate challenging -problems using LLMs without human involvement. For a given task, our approach -builds a hard problem in a bottom-up manner from simpler components. Moreover, -our framework decomposes the generation process into independently verifiable -sub-tasks, thereby ensuring a high level of quality and correctness. We -implement CHASE to create evaluation benchmarks across three diverse domains: -(1) document-based question answering, (2) repository-level code completion, -and (3) math reasoning. The performance of state-of-the-art LLMs on these -synthetic benchmarks lies in the range of 40-60% accuracy, thereby -demonstrating the effectiveness of our framework at generating challenging -problems. We publicly release our benchmarks and code. +We explore deep generative models to generate case-based explanations in a +medical federated learning setting. Explaining AI model decisions through +case-based interpretability is paramount to increasing trust and allowing +widespread adoption of AI in clinical practice. However, medical AI training +paradigms are shifting towards federated learning settings in order to comply +with data protection regulations. In a federated scenario, past data is +inaccessible to the current user. Thus, we use a deep generative model to +generate synthetic examples that protect privacy and explain decisions. Our +proof-of-concept focuses on pleural effusion diagnosis and uses publicly +available Chest X-ray data. -摘要:大型語言模型 (LLM) 的演化速度需要新的方法來進行嚴謹且全面的評估。由於產生高品質、具挑戰性的問題所涉及的複雜性和成本,傳統的人工標註正變得越來越不可行。在這項工作中,我們介紹了 CHASE,一個統一的框架,用於使用 LLM 合成產生具有挑戰性的問題,而無需人工參與。對於給定的任務,我們的做法是以自下而上的方式從更簡單的組成部分來建立一個困難的問題。此外,我們的框架將生成過程分解為獨立可驗證的子任務,從而確保高品質和正確性。我們實作 CHASE 來建立三個不同領域的評估基準:(1) 基於文件的問答、(2) 儲存庫層級的程式碼完成,以及 (3) 數學推理。最先進的 LLM 在這些合成基準上的效能落在 40-60% 的準確度範圍內,從而證明了我們的框架在產生具有挑戰性的問題上的有效性。我們公開發布我們的基準和程式碼。 +摘要:我們探索深度生成模型,在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策,對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而,醫療 AI 訓練範例正轉向聯邦學習設置,以符合資料保護法規。在聯邦情境中,過去的資料對目前的使用者而言是無法取得的。因此,我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷,並使用公開可取得的胸部 X 光資料。 -##### **Data-Constrained Synthesis of Training Data for De-Identification** -2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis +##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines** +2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans -Many sensitive domains -- such as the clinical domain -- lack widely -available datasets due to privacy risks. The increasing generative capabilities -of large language models (LLMs) have made synthetic datasets a viable path -forward. In this study, we domain-adapt LLMs to the clinical domain and -generate synthetic clinical texts that are machine-annotated with tags for -personally identifiable information using capable encoder-based NER models. The -synthetic corpora are then used to train synthetic NER models. The results show -that training NER models using synthetic corpora incurs only a small drop in -predictive performance. The limits of this process are investigated in a -systematic ablation study -- using both Swedish and Spanish data. Our analysis -shows that smaller datasets can be sufficient for domain-adapting LLMs for data -synthesis. Instead, the effectiveness of this process is almost entirely -contingent on the performance of the machine-annotating NER models trained -using the original data. +Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging +lesions with variable clinical behaviours and treatment approaches. This +systematic review provides an overview of Artificial Intelligence (AI) methods +using radiological imaging for diagnosis and prognosis of these tumours, +highlighting challenges in clinical translation, and evaluating study alignment +with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI +international consensus guidelines for trustworthy and deployable AI to promote +the clinical translation of AI methods. The review covered literature from +several bibliographic databases, including papers published before 17/07/2024. +Original research in peer-reviewed journals focused on radiology-based AI for +diagnosing or prognosing primary STBT was included. Exclusion criteria were +animal, cadaveric, or laboratory studies, and non-English papers. Abstracts +were screened by two of three independent reviewers for eligibility. Eligible +papers were assessed against guidelines by one of three independent reviewers. +The search identified 15,015 abstracts, from which 325 articles were included +for evaluation. Most studies performed moderately on CLAIM, averaging a score +of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out +of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage, +indicating significant room for improvement. Future efforts by AI developers +should focus on design (e.g. define unmet clinical need, intended clinical +setting and how AI would be integrated in clinical workflow), development (e.g. +build on previous work, explainability), evaluation (e.g. evaluating and +addressing biases, evaluating AI against best practices), and data +reproducibility and availability (making documented code and data publicly +available). Following these recommendations could improve clinical translation +of AI methods. -摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 +摘要:軟組織和骨骼腫瘤(STBT)是罕見、診斷具有挑戰性的病灶,其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀,重點說明了臨床轉譯的挑戰,並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性,以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻,包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究,以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要,其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等,平均得分為 53 分中的 28.9±7.5 分,但在 FUTURE-AI 中表現不佳,平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段,表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計(例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中)、開發(例如建立在先前的工作、可解釋性)、評估(例如評估和解決偏差、評估 AI 與最佳實務)、以及數據可複製性和可用性(公開提供文件化的代碼和數據)。遵循這些建議可以改善 AI 方法的臨床轉譯。 -##### **BP-SGCN: Behavioral Pseudo-Label Informed Sparse Graph Convolution Network for Pedestrian and Heterogeneous Trajectory Prediction** -2502.14676v1 by Ruochen Li, Stamos Katsigiannis, Tae-Kyun Kim, Hubert P. H. Shum +##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy** +2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen -Trajectory prediction allows better decision-making in applications of -autonomous vehicles or surveillance by predicting the short-term future -movement of traffic agents. It is classified into pedestrian or heterogeneous -trajectory prediction. The former exploits the relatively consistent behavior -of pedestrians, but is limited in real-world scenarios with heterogeneous -traffic agents such as cyclists and vehicles. The latter typically relies on -extra class label information to distinguish the heterogeneous agents, but such -labels are costly to annotate and cannot be generalized to represent different -behaviors within the same class of agents. In this work, we introduce the -behavioral pseudo-labels that effectively capture the behavior distributions of -pedestrians and heterogeneous agents solely based on their motion features, -significantly improving the accuracy of trajectory prediction. To implement the -framework, we propose the Behavioral Pseudo-Label Informed Sparse Graph -Convolution Network (BP-SGCN) that learns pseudo-labels and informs to a -trajectory predictor. For optimization, we propose a cascaded training scheme, -in which we first learn the pseudo-labels in an unsupervised manner, and then -perform end-to-end fine-tuning on the labels in the direction of increasing the -trajectory prediction accuracy. Experiments show that our pseudo-labels -effectively model different behavior clusters and improve trajectory -prediction. Our proposed BP-SGCN outperforms existing methods using both -pedestrian (ETH/UCY, pedestrian-only SDD) and heterogeneous agent datasets -(SDD, Argoverse 1). +Early detection of Cerebral Palsy (CP) is crucial for effective intervention +and monitoring. This paper tests the reliability and applicability of +Explainable AI (XAI) methods using a deep learning method that predicts CP by +analyzing skeletal data extracted from video recordings of infant movements. +Specifically, we use XAI evaluation metrics -- namely faithfulness and +stability -- to quantitatively assess the reliability of Class Activation +Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this +specific medical application. We utilize a unique dataset of infant movements +and apply skeleton data perturbations without distorting the original dynamics +of the infant movements. Our CP prediction model utilizes an ensemble approach, +so we evaluate the XAI metrics performances for both the overall ensemble and +the individual models. Our findings indicate that both XAI methods effectively +identify key body points influencing CP predictions and that the explanations +are robust against minor data perturbations. Grad-CAM significantly outperforms +CAM in the RISv metric, which measures stability in terms of velocity. In +contrast, CAM performs better in the RISb metric, which relates to bone +stability, and the RRS metric, which assesses internal representation +robustness. Individual models within the ensemble show varied results, and +neither CAM nor Grad-CAM consistently outperform the other, with the ensemble +approach providing a representation of outcomes from its constituent models. -摘要:軌跡預測允許在自動駕駛車輛或監視應用中做出更好的決策,藉由預測交通代理的短期未來移動。它被分類為行人或異質軌跡預測。前者利用行人相對一致的行為,但受限於與自行車騎士和車輛等異質交通代理的真實世界場景。後者通常依賴額外的類別標籤資訊來區分異質代理,但此類標籤的註解成本很高,且無法概括為表示同一類別代理中的不同行為。在這項工作中,我們引入了行為偽標籤,它僅根據行人和異質代理的運動特徵有效捕捉行為分佈,顯著提升軌跡預測的準確度。為實作架構,我們提出了行為偽標籤告知稀疏圖形卷積網路 (BP-SGCN),它學習偽標籤並告知軌跡預測器。針對最佳化,我們提出了一種串聯訓練方案,其中我們首先以非監督的方式學習偽標籤,然後在標籤上執行端到端微調,朝著提升軌跡預測準確度的方向進行。實驗顯示我們的偽標籤有效建模不同的行為叢集,並提升軌跡預測。我們提出的 BP-SGCN 使用行人 (ETH/UCY,僅限行人的 SDD) 和異質代理資料集 (SDD,Argoverse 1) 都優於現有方法。 +摘要:腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性,使用深度學習方法,透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說,我們使用 XAI 評估指標(即忠實度和穩定性)來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集,並應用骨骼資料擾動,而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法,因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明,兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位,並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM,該指標衡量速度方面的穩定性。相比之下,CAM 在 RISb 指標中表現得更好,該指標與骨骼穩定性有關,而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果,CAM 和 Grad-CAM 都不一致地優於另一種,整體方法提供了其組成模型結果的表示。 -##### **Explanations of Deep Language Models Explain Language Representations in the Brain** -2502.14671v1 by Maryam Rahimi, Yadollah Yaghoobzadeh, Mohammad Reza Daliri +##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy** +2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma -Recent advances in artificial intelligence have given rise to large language -models (LLMs) that not only achieve human-like performance but also share -computational principles with the brain's language processing mechanisms. While -previous research has primarily focused on aligning LLMs' internal -representations with neural activity, we introduce a novel approach that -leverages explainable AI (XAI) methods to forge deeper connections between the -two domains. Using attribution methods, we quantified how preceding words -contribute to an LLM's next-word predictions and employed these explanations to -predict fMRI recordings from participants listening to the same narratives. Our -findings demonstrate that attribution methods robustly predict brain activity -across the language network, surpassing traditional internal representations in -early language areas. This alignment is hierarchical: early-layer explanations -correspond to the initial stages of language processing in the brain, while -later layers align with more advanced stages. Moreover, the layers more -influential on LLM next-word prediction$\unicode{x2014}$those with higher -attribution scores$\unicode{x2014}$exhibited stronger alignment with neural -activity. This work establishes a bidirectional bridge between AI and -neuroscience. First, we demonstrate that attribution methods offer a powerful -lens for investigating the neural mechanisms of language comprehension, -revealing how meaning emerges from preceding context. Second, we propose using -brain alignment as a metric to evaluate the validity of attribution methods, -providing a framework for assessing their biological plausibility. +Recent global estimates suggest that as many as 2.41 billion individuals have +health conditions that would benefit from rehabilitation services. Home-based +Physical Therapy (PT) faces significant challenges in providing interactive +feedback and meaningful observation for therapists and patients. To fill this +gap, we present MicroXercise, which integrates micro-motion analysis with +wearable sensors, providing therapists and patients with a comprehensive +feedback interface, including video, text, and scores. Crucially, it employs +multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable +methods to analyze the existing deep learning neural networks in monitoring +exercises, focusing on a high granularity of exercise. This synergistic +approach is pivotal, providing output matching the input size to precisely +highlight critical subtleties and movements in PT, thus transforming complex AI +analysis into clear, actionable feedback. By highlighting these micro-motions +in different metrics, such as stability and range of motion, MicroXercise +significantly enhances the understanding and relevance of feedback for +end-users. Comparative performance metrics underscore its effectiveness over +traditional methods, such as a 39% and 42% improvement in Feature Mutual +Information (FMI) and Continuity. MicroXercise is a step ahead in home-based +physical therapy, providing a technologically advanced and intuitively helpful +solution to enhance patient care and outcomes. -摘要:最近的人工智能的進展產生了大型語言模型 (LLM),它不僅達到類似人類的表現,還與大腦的語言處理機制共享計算原理。雖然先前的研究主要集中於將 LLM 的內部表徵與神經活動對齊,但我們引入了一種新穎的方法,該方法利用可解釋 AI (XAI) 方法在兩個域之間建立更深層的聯繫。使用歸因方法,我們量化了前一個單詞如何促成 LLM 的下一個單詞預測,並利用這些解釋來預測參與者在聆聽相同敘述時的大腦功能性磁共振造影 (fMRI) 記錄。我們的發現表明,歸因方法可以穩健地預測整個語言網路中的大腦活動,超越了早期語言區域中的傳統內部表徵。這種對齊是分層的:早期層次解釋對應於大腦中語言處理的初始階段,而後續層次則與更進階的階段對齊。此外,對 LLM 下一個單詞預測影響力較大的層次(即歸因分數較高的層次)表現出與神經活動更強的對齊。這項工作在 AI 與神經科學之間建立了一個雙向橋樑。首先,我們證明歸因方法提供了一個強大的視角,用於研究語言理解的神經機制,揭示意義如何從先前的脈絡中產生。其次,我們建議使用大腦對齊作為評估歸因方法有效性的指標,提供了一個評估其生物學合理性的框架。 +摘要:最近的全球估計表明,多達 24.1 億人有 +健康狀況可從復健服務中受益。居家 +物理治療 (PT) 在提供互動式 +回饋和有意義的觀察方面面臨重大挑戰,供治療師和患者使用。為了填補這 +個缺口,我們提出 MicroXercise,它將微動作分析與 +可穿戴式感測器整合在一起,為治療師和患者提供一個全面的 +回饋介面,包括影片、文字和分數。至關重要的是,它採用 +多維動態時間規整 (DTW) 和基於歸因的可解釋 +方法來分析監控運動中現有的深度學習神經網路,專注於運動的高粒度。這種協同 +方法至關重要,提供與輸入大小匹配的輸出,以精確地 +突出 PT 中關鍵的細微差別和動作,從而將複雜的 AI +分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作,例如穩定性和動作範圍,MicroXercise +顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於 +傳統方法的有效性,例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家 +物理治療方面更進一步,提供技術先進且直覺有用的 +解決方案,以提升患者照護和結果。 -##### **AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO** -2502.14669v1 by Alan Dao, Dinh Bach Vu +##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development** +2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz -Large Language Models (LLMs) have demonstrated impressive capabilities in -language processing, yet they often struggle with tasks requiring genuine -visual spatial reasoning. In this paper, we introduce a novel two-stage -training framework designed to equip standard LLMs with visual reasoning -abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) -on a curated dataset of tokenized maze representations to teach the model to -predict step-by-step movement commands. Next, we apply Group Relative Policy -Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted -reward function to refine the model's sequential decision-making and encourage -emergent chain-of-thought behaviors. Experimental results on synthetically -generated mazes show that while a baseline model fails to navigate the maze, -the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning -boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more -robust and self-corrective reasoning, highlighting the potential of our -approach to bridge the gap between language models and visual spatial tasks. -These findings offer promising implications for applications in robotics, -autonomous navigation, and other domains that require integrated visual and -sequential reasoning. +Systematic literature reviews are the highest quality of evidence in +research. However, the review process is hindered by significant resource and +data constraints. The Literature Review Network (LRN) is the first of its kind +explainable AI platform adhering to PRISMA 2020 standards, designed to automate +the entire literature review process. LRN was evaluated in the domain of +surgical glove practices using 3 search strings developed by experts to query +PubMed. A non-expert trained all LRN models. Performance was benchmarked +against an expert manual review. Explainability and performance metrics +assessed LRN's ability to replicate the experts' review. Concordance was +measured with the Jaccard index and confusion matrices. Researchers were +blinded to the other's results until study completion. Overlapping studies were +integrated into an LRN-generated systematic review. LRN models demonstrated +superior classification accuracy without expert training, achieving 84.78% and +85.71% accuracy. The highest performance model achieved high interrater +reliability (k = 0.4953) and explainability metrics, linking 'reduce', +'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51% +of the relevant literature despite diverging from the non-expert's judgments (k += 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN +outperformed the manual review (19,920 minutes over 11 months), reducing the +entire process to 288.6 minutes over 5 days. This study demonstrates that +explainable AI does not require expert training to successfully conduct +PRISMA-compliant systematic literature reviews like an expert. LRN summarized +the results of surgical glove studies and identified themes that were nearly +identical to the clinical researchers' findings. Explainable AI can accurately +expedite our understanding of clinical practices, potentially revolutionizing +healthcare research. -摘要:大型語言模型(LLM)在語言處理方面展現出令人印象深刻的能力,但它們經常難以應付需要真正視覺空間推理的任務。在本文中,我們介紹了一種新穎的兩階段訓練架構,旨在為標準 LLM 提供迷宮導航的視覺推理能力。首先,我們在標記化迷宮表示的策展資料集上利用監督微調(SFT)來教導模型預測逐步移動指令。接下來,我們使用 DeepSeekR1 中使用的技術,即群體相對策略最佳化(GRPO),並搭配精心設計的獎勵函數來優化模型的順序決策制定,並鼓勵出現連貫的思考行為。在合成產生的迷宮上進行的實驗結果顯示,雖然基準模型無法導航迷宮,但經過 SFT 訓練的模型達到 86% 的準確度,而進一步的 GRPO 微調將準確度提升至 93%。定性分析顯示,GRPO 促進更強健且自我修正的推理,凸顯了我們的方法在彌合語言模型與視覺空間任務之間差距的潛力。這些發現為機器人、自主導航和其他需要整合視覺和順序推理的領域的應用提供了有希望的啟示。 +摘要:系統性文獻回顧是研究中證據品質最高的。然而,回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台,旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估,使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率,達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標,將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻,儘管與非專家的判斷不同 (k = 0.2174),但包含了「乳膠」、「雙重」(手套)和「適應症」等詞彙。LRN 優於手動回顧(11 個月超過 19,920 分鐘),將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示,可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果,並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解,有潛力革新醫療保健研究。 -##### **InstructAgent: Building User Controllable Recommender via LLM Agent** -2502.14662v1 by Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang +##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns** +2408.02709v1 by Chi Him Ng -Traditional recommender systems usually take the user-platform paradigm, -where users are directly exposed under the control of the platform's -recommendation algorithms. However, the defect of recommendation algorithms may -put users in very vulnerable positions under this paradigm. First, many -sophisticated models are often designed with commercial objectives in mind, -focusing on the platform's benefits, which may hinder their ability to protect -and capture users' true interests. Second, these models are typically optimized -using data from all users, which may overlook individual user's preferences. -Due to these shortcomings, users may experience several disadvantages under the -traditional user-platform direct exposure paradigm, such as lack of control -over the recommender system, potential manipulation by the platform, echo -chamber effects, or lack of personalization for less active users due to the -dominance of active users during collaborative learning. Therefore, there is an -urgent need to develop a new paradigm to protect user interests and alleviate -these issues. Recently, some researchers have introduced LLM agents to simulate -user behaviors, these approaches primarily aim to optimize platform-side -performance, leaving core issues in recommender systems unresolved. To address -these limitations, we propose a new user-agent-platform paradigm, where agent -serves as the protective shield between user and recommender system that -enables indirect exposure. To this end, we first construct four recommendation -datasets, denoted as $\dataset$, along with user instructions for each record. +This study analyzes hybrid AI systems' design patterns and their +effectiveness in clinical decision-making using the boxology framework. It +categorizes and copares various architectures combining machine learning and +rule-based reasoning to provide insights into their structural foundations and +healthcare applications. Addressing two main questions, how to categorize these +systems againts established design patterns and how to extract insights through +comparative analysis, the study uses design patterns from software engineering +to understand and optimize healthcare AI systems. Boxology helps identify +commonalities and create reusable solutions, enhancing these systems' +scalability, reliability, and performance. Five primary architectures are +examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and +weaknesses, highlighting the need for tailored approaches in clinical tasks. +REML excels in high-accuracy prediction for datasets with limited data; MLRB in +handling large datasets and complex data integration; RBML in explainability +and trustworthiness; RMLT in managing high-dimensional data; and PERML, though +limited in analysis, shows promise in urgent care scenarios. The study +introduces four new patterns, creates five abstract categorization patterns, +and refines those five further to specific systems. These contributions enhance +Boxlogy's taxonomical organization and offer novel approaches to integrating +expert knowledge with machine learning. Boxology's structured, modular apporach +offers significant advantages in developing and analyzing hybrid AI systems, +revealing commonalities, and promoting reusable solutions. In conclusion, this +study underscores hybrid AI systems' crucial role in advancing healthcare and +Boxology's potential to drive further innovation in AI integration, ultimately +improving clinical decision support and patient outcomes. -摘要:傳統推薦系統通常採用使用者-平台範例, -其中使用者直接暴露在平台推薦演算法的控制之下。然而,推薦演算法的缺陷可能會讓使用者在這個範例中處於非常脆弱的位置。首先,許多精密的模型通常在設計時就考慮到商業目標,專注於平台的利益,這可能會阻礙它們保護和掌握使用者真正興趣的能力。其次,這些模型通常使用所有使用者的資料進行最佳化,這可能會忽略個別使用者的偏好。由於這些缺點,使用者可能會在傳統使用者-平台直接暴露範例中遇到一些缺點,例如缺乏對推薦系統的控制、平台的潛在操縱、同溫層效應,或由於活躍使用者在協作學習中的主導地位而缺乏針對較不活躍使用者的個人化。因此,迫切需要開發一種新的範例來保護使用者利益並緩解這些問題。最近,一些研究人員引入了 LLM 代理程式來模擬使用者行為,這些方法主要旨在最佳化平台端的效能,而未解決推薦系統中的核心問題。為了解決這些限制,我們提出了一種新的使用者-代理程式-平台範例,其中代理程式作為使用者和推薦系統之間的保護盾,實現間接暴露。為此,我們首先構建了四個推薦資料集,表示為 $\dataset$,以及每條記錄的使用者說明。 +摘要:本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構,以深入了解其結構基礎和醫療保健應用。針對兩個主要問題,如何根據既定的設計模式對這些系統進行分類,以及如何通過比較分析提取見解,本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案,從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構:REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點,強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測;MLRB 在處理大型資料集和複雜資料整合方面表現出色;RBML 在可解釋性和可信度方面表現出色;RMLT 在管理高維資料方面表現出色;而 PERML 儘管在分析方面有限,但在緊急照護場景中表現出潛力。本研究引入了四種新模式,建立了五種抽象分類模式,並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織,並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之,本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用,以及盒子學在推動人工智慧整合進一步創新方面的潛力,最終改善臨床決策支援和患者的治療成果。 -##### **Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs** -2502.14645v1 by Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao +##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability** +2408.02706v1 by Masoud Muhammed Hassan -Knowledge editing allows for efficient adaptation of large language models -(LLMs) to new information or corrections without requiring full retraining. -However, prior methods typically focus on either single-language editing or -basic multilingual editing, failing to achieve true cross-linguistic knowledge -synchronization. To address this, we present a simple and practical -state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), -designed to propagate knowledge from a dominant language to other languages -effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition -Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel -dataset to modify in-scope knowledge while preserving unrelated information, -and (ii) Target-language Preference Optimization (TL-PO), which applies -advanced optimization techniques to ensure consistency across languages, -fostering the transfer of updates. Additionally, we contribute a high-quality, -cross-lingual dataset, specifically designed to enhance knowledge transfer -across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks -show that X-KDE significantly enhances cross-lingual performance, achieving an -average improvement of +8.19%, while maintaining high accuracy in monolingual -settings. +Because of its strong predictive skills, deep learning has emerged as an +essential tool in many industries, including healthcare. Traditional deep +learning models, on the other hand, frequently lack interpretability and omit +to take prediction uncertainty into account two crucial components of clinical +decision making. In order to produce explainable and uncertainty aware +predictions, this study presents a novel framework called Bayesian Kolmogorov +Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov +Arnold Networks with Bayesian inference. We employ BKANs on two medical +datasets, which are widely used benchmarks for assessing machine learning +models in medical diagnostics: the Pima Indians Diabetes dataset and the +Cleveland Heart Disease dataset. Our method provides useful insights into +prediction confidence and decision boundaries and outperforms traditional deep +learning models in terms of prediction accuracy. Moreover, BKANs' capacity to +represent aleatoric and epistemic uncertainty guarantees doctors receive more +solid and trustworthy decision support. Our Bayesian strategy improves the +interpretability of the model and considerably minimises overfitting, which is +important for tiny and imbalanced medical datasets, according to experimental +results. We present possible expansions to further use BKANs in more +complicated multimodal datasets and address the significance of these +discoveries for future research in building reliable AI systems for healthcare. +This work paves the way for a new paradigm in deep learning model deployment in +vital sectors where transparency and reliability are crucial. -摘要:知識編輯允許大語言模型 (LLM) 有效地適應新資訊或修正,而無需進行完整的再訓練。 -然而,先前的做法通常專注於單一語言編輯或基本的語音編輯,未能實現真正的跨語言知識同步。為了解決這個問題,我們提出了一個簡單且實用的最先進 (SOTA) 配方,即跨語言知識民主編輯 (X-KDE),旨在有效地從主導語言傳播知識到其他語言。我們的 X-KDE 包含兩個階段:(i) 跨語言版本指令調整 (XE-IT),它微調模型,在經過整理的平行資料集上修改範圍內的知識,同時保留不相關的資訊,以及 (ii) 目標語言偏好最佳化 (TL-PO),它應用先進的最佳化技術,以確保跨語言的一致性,促進更新的傳輸。此外,我們貢獻了一個高品質的跨語言資料集,特別設計用於增強跨語言的知識傳輸。在 Bi-ZsRE 和 MzsRE 基準上的廣泛實驗表明,X-KDE 大幅提升了跨語言效能,在單語言設定中維持高準確度的同時,平均提升了 +8.19%。 +摘要:由於其強大的預測能力,深度學習已成為許多產業中不可或缺的工具,包括醫療保健。然而,傳統的深度學習模型通常缺乏可解釋性,並且忽略了將預測不確定性納入考量,而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測,本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構,它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN,這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準:皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解,並且在預測準確度方面優於傳統的深度學習模型。此外,BKAN 表現隨機和認識不確定性的能力,可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果,我們的貝氏策略提高了模型的可解釋性,並大幅減少了過度擬合,這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能,以進一步將 BKAN 用於更複雜的多模式資料集,並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。 -##### **LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning** -2502.14644v1 by Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang +##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI** +2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal -Long context understanding remains challenging for large language models due -to their limited context windows. This paper presents Long Input Fine-Tuning -(LIFT), a novel framework for long-context modeling that can improve the -long-context performance of arbitrary (short-context) LLMs by dynamically -adapting model parameters based on the long input. Importantly, LIFT, rather -than endlessly extending the context window size to accommodate increasingly -longer inputs in context, chooses to store and absorb the long input in -parameter. By fine-tuning the long input into model parameters, LIFT allows -short-context LLMs to answer questions even when the required information is -not provided in the context during inference. Furthermore, to enhance LIFT -performance while maintaining the original in-context learning (ICL) -capabilities, we introduce Gated Memory, a specialized attention adapter that -automatically balances long input memorization and ICL. We provide a -comprehensive analysis of the strengths and limitations of LIFT on long context -understanding, offering valuable directions for future research. +In modern healthcare, addressing the complexities of accurate disease +prediction and personalized recommendations is both crucial and challenging. +This research introduces MLtoGAI, which integrates Semantic Web technology with +Machine Learning (ML) to enhance disease prediction and offer user-friendly +explanations through ChatGPT. The system comprises three key components: a +reusable disease ontology that incorporates detailed knowledge about various +diseases, a diagnostic classification model that uses patient symptoms to +detect specific diseases accurately, and the integration of Semantic Web Rule +Language (SWRL) with ontology and ChatGPT to generate clear, personalized +health advice. This approach significantly improves prediction accuracy and +ensures results that are easy to understand, addressing the complexity of +diseases and diverse symptoms. The MLtoGAI system demonstrates substantial +advancements in accuracy and user satisfaction, contributing to developing more +intelligent and accessible healthcare solutions. This innovative approach +combines the strengths of ML algorithms with the ability to provide +transparent, human-understandable explanations through ChatGPT, achieving +significant improvements in prediction accuracy and user comprehension. By +leveraging semantic technology and explainable AI, the system enhances the +accuracy of disease prediction and ensures that the recommendations are +relevant and easily understood by individual patients. Our research highlights +the potential of integrating advanced technologies to overcome existing +challenges in medical diagnostics, paving the way for future developments in +intelligent healthcare systems. Additionally, the system is validated using 200 +synthetic patient data records, ensuring robust performance and reliability. -摘要:由於大型語言模型的上下文視窗有限,因此對於它們而言,長語境理解仍然具有挑戰性。本文提出了長輸入微調 (LIFT),這是一個用於長語境建模的新穎架構,它可以通過根據長輸入動態調整模型參數來改善任意(短語境)LLM 的長語境效能。重要的是,LIFT 沒有無限擴充上下文視窗大小以容納語境中越來越長的輸入,而是選擇將長輸入儲存在參數中並吸收它。通過將長輸入微調到模型參數中,LIFT 允許短語境 LLM 回答問題,即使在推理期間語境中沒有提供所需資訊也是如此。此外,為了在保持原始語境中學習 (ICL) 能力的同時增強 LIFT 效能,我們引入了閘控記憶體,這是一個自動平衡長輸入記憶和 ICL 的特殊注意力適配器。我們對 LIFT 在長語境理解方面的優缺點進行了全面的分析,為未來的研究提供了有價值的方向。 +摘要:在現代醫療保健中,解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI,它將語義網路技術與機器學習 (ML) 相結合,以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分:一個可重複使用的疾病本体,其中包含有關各種疾病的詳細知識;一個診斷分類模型,它使用患者症狀來準確檢測特定疾病;以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合,以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性,並確保了易於理解的結果,解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步,有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點,以及透過 ChatGPT 提供透明且人類可以理解的說明的能力,在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI,該系統提高了疾病預測的準確性,並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力,為智慧醫療保健系統的未來發展鋪路。此外,該系統使用 200 個合成患者資料記錄進行驗證,確保了穩健的效能和可靠性。 -##### **Length-Controlled Margin-Based Preference Optimization without Reference Model** -2502.14643v1 by Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu +##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations** +2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora -Direct Preference Optimization (DPO) is a widely adopted offline algorithm -for preference-based reinforcement learning from human feedback (RLHF), -designed to improve training simplicity and stability by redefining reward -functions. However, DPO is hindered by several limitations, including length -bias, memory inefficiency, and probability degradation. To address these -challenges, we propose Length-Controlled Margin-Based Preference Optimization -(LMPO), a more efficient and robust alternative. LMPO introduces a uniform -reference model as an upper bound for the DPO loss, enabling a more accurate -approximation of the original optimization objective. Additionally, an average -log-probability optimization strategy is employed to minimize discrepancies -between training and inference phases. A key innovation of LMPO lies in its -Length-Controlled Margin-Based loss function, integrated within the -Bradley-Terry framework. This loss function regulates response length while -simultaneously widening the margin between preferred and rejected outputs. By -doing so, it mitigates probability degradation for both accepted and discarded -responses, addressing a significant limitation of existing methods. We evaluate -LMPO against state-of-the-art preference optimization techniques on two -open-ended large language models, Mistral and LLaMA3, across six conditional -benchmarks. Our experimental results demonstrate that LMPO effectively controls -response length, reduces probability degradation, and outperforms existing -approaches. The code is available at \url{https://github.com/gengxuli/LMPO}. +Explainable Artificial Intelligence (XAI) is central to the debate on +integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms +into clinical practice. High-performing AI/ML models, such as ensemble learners +and deep neural networks, often lack interpretability, hampering clinicians' +trust in their predictions. To address this, XAI techniques are being developed +to describe AI/ML predictions in human-understandable terms. One promising +direction is the adaptation of sensitivity analysis (SA) and global sensitivity +analysis (GSA), which inherently rank model inputs by their impact on +predictions. Here, we introduce a novel delta-XAI method that provides local +explanations of ML model predictions by extending the delta index, a GSA +metric. The delta-XAI index assesses the impact of each feature's value on the +predicted output for individual instances in both regression and classification +problems. We formalize the delta-XAI index and provide code for its +implementation. The delta-XAI method was evaluated on simulated scenarios using +linear regression models, with Shapley values serving as a benchmark. Results +showed that the delta-XAI index is generally consistent with Shapley values, +with notable discrepancies in models with highly impactful or extreme feature +values. The delta-XAI index demonstrated higher sensitivity in detecting +dominant features and handling extreme feature values. Qualitatively, the +delta-XAI provides intuitive explanations by leveraging probability density +functions, making feature rankings clearer and more explainable for +practitioners. Overall, the delta-XAI method appears promising for robustly +obtaining local explanations of ML model predictions. Further investigations in +real-world clinical settings will be conducted to evaluate its impact on +AI-assisted clinical workflows. -摘要:直接偏好優化 (DPO) 是一種廣泛採用的離線演算法,用於從人類回饋 (RLHF) 中進行基於偏好的強化學習,旨在透過重新定義獎勵函數來提升訓練的簡潔性和穩定性。然而,DPO 受到若干限制的阻礙,包括長度偏差、記憶體效率低下和機率下降。為了解決這些挑戰,我們提出長度控制邊際偏好優化 (LMPO),一種更有效率且穩健的替代方案。LMPO 引入統一參考模型作為 DPO 損失的上限,能夠更準確地近似原始最佳化目標。此外,採用平均對數機率最佳化策略來最小化訓練和推論階段之間的差異。LMPO 的一項關鍵創新在於其長度控制邊際損失函數,整合在 Bradley-Terry 架構中。此損失函數調節回應長度,同時擴大偏好和拒絕輸出之間的邊際。藉由這麼做,它減輕了已接受和已捨棄回應的機率下降,解決了現有方法的重大限制。我們在兩個開放式大型語言模型 Mistral 和 LLaMA3 上,針對六個條件基準,評估 LMPO 與最先進的偏好優化技術。我們的實驗結果證明,LMPO 有效控制回應長度,減少機率下降,並優於現有方法。程式碼可在 \url{https://github.com/gengxuli/LMPO} 取得。 +摘要:可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型,例如整體學習器和深度神經網路,通常缺乏可解釋性,阻礙臨床醫生對其預測的信任。為了解決這個問題,正在開發 XAI 技術,以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA),它們本質上會依據模型輸入對預測的影響來對其進行排名。在此,我們介紹一種新的 delta-XAI 方法,透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化,並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法,並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致,但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說,delta-XAI 透過利用機率密度函數提供直觀的解釋,使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言,delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查,以評估其對 AI 輔助臨床工作流程的影響。 -##### **How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation** -2502.14642v1 by Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui +##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population** +2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis -Recently, LLMs have garnered increasing attention across academic disciplines -for their potential as human digital twins, virtual proxies designed to -replicate individuals and autonomously perform tasks such as decision-making, -problem-solving, and reasoning on their behalf. However, current evaluations of -LLMs primarily emphasize dialogue simulation while overlooking human behavior -simulation, which is crucial for digital twins. To address this gap, we -introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to -simulate continuous human behavior. BehaviorChain comprises diverse, -high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors -across 1,001 unique personas, each with detailed history and profile metadata. -For evaluation, we integrate persona metadata into LLMs and employ them to -iteratively infer contextually appropriate behaviors within dynamic scenarios -provided by BehaviorChain. Comprehensive evaluation results demonstrated that -even state-of-the-art models struggle with accurately simulating continuous -human behavior. +Dementia, a debilitating neurological condition affecting millions worldwide, +presents significant diagnostic challenges. In this work, we introduce a novel +methodology for the classification of demented and non-demented elderly +patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach +features a unique technique for selectively processing MRI slices, focusing on +the most relevant brain regions and excluding less informative sections. This +methodology is complemented by a confidence-based classification committee +composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and +Dem3D EfficientNet. These models work synergistically to enhance +decision-making accuracy, leveraging their collective strengths. Tested on the +Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an +impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, +validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset +confirmed the robustness and generalizability of our approach. The use of +explainable AI (XAI) techniques and comprehensive ablation studies further +substantiate the effectiveness of our techniques, providing insights into the +decision-making process and the importance of our methodology. This research +offers a significant advancement in dementia diagnosis, providing a highly +accurate and efficient tool for clinical applications. + +摘要:失智症是一種影響全球數百萬人的衰弱性神經疾病,在診斷上具有重大挑戰。在這項工作中,我們提出了一種新的方法,用於對失智和非失智老年患者進行分類,使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術,用於選擇性處理 MRI 切片,重點關注最相關的大腦區域,並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充,該委員會由三個自定義深度學習模型組成:Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性,利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試,我們的模型達到了 94.12% 的驚人準確度,超過了現有方法。此外,在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性,提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展,為臨床應用提供了一個高度準確且高效的工具。 + +##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition** +2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini + +Recognizing daily activities with unobtrusive sensors in smart environments +enables various healthcare applications. Monitoring how subjects perform +activities at home and their changes over time can reveal early symptoms of +health issues, such as cognitive decline. Most approaches in this field use +deep learning models, which are often seen as black boxes mapping sensor data +to activities. However, non-expert users like clinicians need to trust and +understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human +Activity Recognition have emerged to provide intuitive natural language +explanations from these models. Different XAI methods generate different +explanations, and their effectiveness is typically evaluated through user +surveys, that are often challenging in terms of costs and fairness. This paper +proposes an automatic evaluation method using Large Language Models (LLMs) to +identify, in a pool of candidates, the best XAI approach for non-expert users. +Our preliminary results suggest that LLM evaluation aligns with user surveys. -摘要:最近,LLM 在各個學科中備受關注,因為它們具有作為人類數位雙胞胎的潛力,也就是虛擬代理人,旨在複製個人並自主執行任務,例如代表他們進行決策、解決問題和推理。然而,LLM 目前的評估主要強調對話模擬,同時忽視了人類行為模擬,這對數位雙胞胎至關重要。為了解決這個差距,我們引入了 BehaviorChain,這是第一個用於評估 LLM 模擬連續人類行為能力的基準。BehaviorChain 包含多樣化、高品質、基於角色的行為鏈,總共涵蓋 1,001 個獨特角色的 15,846 種不同行為,每個角色都有詳細的歷史和個人資料元數據。在評估中,我們將角色元數據整合到 LLM 中,並使用它們在 BehaviorChain 提供的動態場景中反覆推斷出在情境中適當的行為。全面的評估結果表明,即使是最先進的模型在準確模擬連續人類行為方面也存在困難。 +摘要:藉由智慧環境中不引人注目的感測器辨識日常活動,能啟用各種醫療保健應用。監控受試者在家中如何執行活動,以及其隨著時間的變化,可以揭示健康問題的早期症狀,例如認知能力下降。此領域中的大多數方法都使用深度學習模型,這些模型通常被視為將感測器資料對應至活動的黑盒子。然而,非專家使用者(例如臨床醫師)需要信任並了解這些模型的輸出。因此,人類活動辨識的可解釋 AI (XAI) 方法應運而生,以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明,而其有效性通常透過使用者調查來評估,這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法,以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明,LLM 評估與使用者調查一致。 -##### **NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization** -2502.14638v1 by Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber +##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions** +2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil -Image geo-localization is the task of predicting the specific location of an -image and requires complex reasoning across visual, geographical, and cultural -contexts. While prior Vision Language Models (VLMs) have the best accuracy at -this task, there is a dearth of high-quality datasets and models for analytical -reasoning. We first create NaviClues, a high-quality dataset derived from -GeoGuessr, a popular geography game, to supply examples of expert reasoning -from language. Using this dataset, we present Navig, a comprehensive image -geo-localization framework integrating global and fine-grained image -information. By reasoning with language, Navig reduces the average distance -error by 14% compared to previous state-of-the-art models while requiring fewer -than 1000 training samples. Our dataset and code are available at -https://github.com/SparrowZheyuan18/Navig/. +Industry 5.0, which focuses on human and Artificial Intelligence (AI) +collaboration for performing different tasks in manufacturing, involves a +higher number of robots, Internet of Things (IoTs) devices and +interconnections, Augmented/Virtual Reality (AR), and other smart devices. The +huge involvement of these devices and interconnection in various critical +areas, such as economy, health, education and defense systems, poses several +types of potential security flaws. AI itself has been proven a very effective +and powerful tool in different areas of cybersecurity, such as intrusion +detection, malware detection, and phishing detection, among others. Just as in +many application areas, cybersecurity professionals were reluctant to accept +black-box ML solutions for cybersecurity applications. This reluctance pushed +forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool +that helps explain how decisions are made in ML-based systems. In this survey, +we present a comprehensive study of different XAI-based intrusion detection +systems for industry 5.0, and we also examine the impact of explainability and +interpretability on Cybersecurity practices through the lens of Adversarial +XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities +and challenges in XAI cybersecurity systems for industry 5.0 that elicit future +research toward XAI-based solutions to be adopted by high-stakes industry 5.0 +applications. We believe this rigorous analysis will establish a foundational +framework for subsequent research endeavors within the specified domain. -摘要:影像地理定位是預測影像特定位置的任務,需要跨視覺、地理和文化脈絡進行複雜的推理。雖然先前的視覺語言模型 (VLM) 在此任務中擁有最佳準確度,但缺乏高品質的資料集和分析推理模型。我們首先建立 NaviClues,這是一個源自 GeoGuessr 的高品質資料集,GeoGuessr 是一款流行的地理遊戲,可提供來自語言的專家推理範例。使用此資料集,我們提出 Navig,這是一個綜合性的影像地理定位架構,整合了全球和細緻的影像資訊。透過語言推理,Navig 將平均距離誤差減少了 14%,與先前的最先進模型相比,同時只需要不到 1000 個訓練樣本。我們的資料集和程式碼可在 https://github.com/SparrowZheyuan18/Navig/ 取得。 +摘要:工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務,涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與,引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具,例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣,網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用,有助於說明在基於 ML 的系統中如何做出決策。在這項調查中,我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究,並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外,我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰,引發了未來針對 XAI 基礎解決方案的研究,以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。 -##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** -2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu +##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability** +2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic -Protein backbone generation plays a central role in de novo protein design -and is significant for many biological and medical applications. Although -diffusion and flow-based generative models provide potential solutions to this -challenging task, they often generate proteins with undesired designability and -suffer computational inefficiency. In this study, we propose a novel rectified -quaternion flow (ReQFlow) matching method for fast and high-quality protein -backbone generation. In particular, our method generates a local translation -and a 3D rotation from random noise for each residue in a protein chain, which -represents each 3D rotation as a unit quaternion and constructs its flow by -spherical linear interpolation (SLERP) in an exponential format. We train the -model by quaternion flow (QFlow) matching with guaranteed numerical stability -and rectify the QFlow model to accelerate its inference and improve the -designability of generated protein backbones, leading to the proposed ReQFlow -model. Experiments show that ReQFlow achieves state-of-the-art performance in -protein backbone generation while requiring much fewer sampling steps and -significantly less inference time (e.g., being 37x faster than RFDiffusion and -62x faster than Genie2 when generating a backbone of length 300), demonstrating -its effectiveness and efficiency. The code is available at -https://github.com/AngxiaoYue/ReQFlow. +This study aims to explore the implementation of Natural Language Processing +(NLP) and machine learning (ML) techniques to automate the coding of medical +letters with visualised explainability and light-weighted local computer +settings. Currently in clinical settings, coding is a manual process that +involves assigning codes to each condition, procedure, and medication in a +patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There +are preliminary research on automatic coding in this field using +state-of-the-art ML models; however, due to the complexity and size of the +models, the real-world deployment is not achieved. To further facilitate the +possibility of automatic coding practice, we explore some solutions in a local +computer setting; in addition, we explore the function of explainability for +transparency of AI models. We used the publicly available MIMIC-III database +and the HAN/HLAN network models for ICD code prediction purposes. We also +experimented with the mapping between ICD and SNOMED CT knowledge bases. In our +experiments, the models provided useful information for 97.98\% of codes. The +result of this investigation can shed some light on implementing automatic +clinical coding in practice, such as in hospital settings, on the local +computers used by clinicians , project page +\url{https://github.com/Glenj01/Medical-Coding}. -摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 +摘要:本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化,並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中,編碼是一種手動流程,涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如,使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究;然而,由於模型的複雜性和大小,並未實現實際部署。為了進一步促進自動編碼實務的可能性,我們在本地電腦設定中探討了一些解決方案;此外,我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中,這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解,例如在醫院環境中,由臨床醫生使用的本地電腦,專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。 -##### **PEARL: Towards Permutation-Resilient LLMs** -2502.14628v1 by Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong +##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation** +2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier -The in-context learning (ICL) capability of large language models (LLMs) -enables them to perform challenging tasks using provided demonstrations. -However, ICL is highly sensitive to the ordering of demonstrations, leading to -instability in predictions. This paper shows that this vulnerability can be -exploited to design a natural attack - difficult for model providers to detect -- that achieves nearly 80% success rate on LLaMA-3 by simply permuting the -demonstrations. Existing mitigation methods primarily rely on post-processing -and fail to enhance the model's inherent robustness to input permutations, -raising concerns about safety and reliability of LLMs. To address this issue, -we propose Permutation-resilient learning (PEARL), a novel framework based on -distributionally robust optimization (DRO), which optimizes model performance -against the worst-case input permutation. Specifically, PEARL consists of a -permutation-proposal network (P-Net) and the LLM. The P-Net generates the most -challenging permutations by treating it as an optimal transport problem, which -is solved using an entropy-constrained Sinkhorn algorithm. Through minimax -optimization, the P-Net and the LLM iteratively optimize against each other, -progressively improving the LLM's robustness. Experiments on synthetic -pre-training and real-world instruction tuning tasks demonstrate that PEARL -effectively mitigates permutation attacks and enhances performance. Notably, -despite being trained on fewer shots and shorter contexts, PEARL achieves -performance gains of up to 40% when scaled to many-shot and long-context -scenarios, highlighting its efficiency and generalization capabilities. +The support of artificial intelligence (AI) based decision-making is a key +element in future 6G networks, where the concept of native AI will be +introduced. Moreover, AI is widely employed in different critical applications +such as autonomous driving and medical diagnosis. In such applications, using +AI as black-box models is risky and challenging. Hence, it is crucial to +understand and trust the decisions taken by these models. Tackling this issue +can be achieved by developing explainable AI (XAI) schemes that aim to explain +the logic behind the black-box model behavior, and thus, ensure its efficient +and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST +framework that is oriented toward channel estimation in wireless +communications. The core idea of the XAI-CHEST framework is to identify the +relevant model inputs by inducing high noise on the irrelevant ones. This +manuscript provides the detailed theoretical foundations of the XAI-CHEST +framework. In particular, we derive the analytical expressions of the XAI-CHEST +loss functions and the noise threshold fine-tuning optimization problem. Hence +the designed XAI-CHEST delivers a smart input feature selection methodology +that can further improve the overall performance while optimizing the +architecture of the employed model. Simulation results show that the XAI-CHEST +framework provides valid interpretations, where it offers an improved bit error +rate performance while reducing the required computational complexity in +comparison to the classical DL-based channel estimation. -摘要:大型語言模型 (LLM) 的語境學習 (ICL) 能力使其能夠透過提供的示範來執行具有挑戰性的任務。然而,ICL 對示範的排序非常敏感,導致預測不穩定。本文顯示,可以利用此漏洞來設計一種自然攻擊,讓模型提供者難以偵測,透過簡單地排列示範,在 LLaMA-3 上達到近 80% 的成功率。現有的緩解方法主要依賴後處理,且無法增強模型對輸入排列的固有穩健性,引發了對 LLM 的安全性與可靠性的疑慮。為了解決此問題,我們提出了一種基於分配穩健最佳化 (DRO) 的新型架構,稱為排列彈性學習 (PEARL),它針對最差情況的輸入排列來最佳化模型效能。具體來說,PEARL 包含排列建議網路 (P-Net) 和 LLM。P-Net 將其視為最優傳輸問題來產生最具挑戰性的排列,並使用熵約束 Sinkhorn 演算法來解決。透過極小極大最佳化,P-Net 和 LLM 迭代地相互最佳化,逐步改善 LLM 的穩健性。在合成預訓練和真實世界指令調整任務上的實驗證明,PEARL 有效地減輕了排列攻擊並增強了效能。值得注意的是,儘管在較少的次數和較短的語境中進行訓練,但 PEARL 在擴展到多重次數和長語境場景時仍可獲得高達 40% 的效能提升,突顯了其效率和泛化能力。 +摘要:人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素,其中將引入原生 AI 的概念。此外,AI 廣泛用於不同的關鍵應用中,例如自動駕駛和醫療診斷。在這些應用中,使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此,理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構,旨在解釋黑盒模型行為背後的邏輯,從而確保其有效且安全的部署。最近,我們提出了一個新的基於擾動的 XAI-CHEST 框架,該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是,我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此,設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法,可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明,XAI-CHEST 框架提供了有效的解釋,在降低所需的計算複雜度的同時,提供了改進的比特錯誤率性能,而這與基於傳統 DL 的信道估計相比。 -##### **ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors** -2502.14627v1 by Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou +##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification** +2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman -Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to -retrieve audio clips or multilingual texts from databases. However, existing -ML-ATR schemes suffer from inconsistencies for instance similarity matching -across languages. We theoretically analyze the inconsistency in terms of both -multilingual modal alignment direction error and weight error, and propose the -theoretical weight error upper bound for quantifying the inconsistency. Based -on the analysis of the weight error upper bound, we find that the inconsistency -problem stems from the data distribution error caused by random sampling of -languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive -learning and audio-English co-anchor contrastive learning, aiming to mitigate -the negative impact of data distribution error on recall and consistency in -ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets -show that our scheme achieves state-of-the-art performance on recall and -consistency metrics for eight mainstream languages, including English. Our code -will be available at https://github.com/ATRI-ACL/ATRI-ACL. +This paper presents dilated Residual Network (ResNet) models for disease +classification from retinal fundus images. Dilated convolution filters are used +to replace normal convolution filters in the higher layers of the ResNet model +(dilated ResNet) in order to improve the receptive field compared to the normal +ResNet model for disease classification. This study introduces +computer-assisted diagnostic tools that employ deep learning, enhanced with +explainable AI techniques. These techniques aim to make the tool's +decision-making process transparent, thereby enabling medical professionals to +understand and trust the AI's diagnostic decision. They are particularly +relevant in today's healthcare landscape, where there is a growing demand for +transparency in AI applications to ensure their reliability and ethical use. +The dilated ResNet is used as a replacement for the normal ResNet to enhance +the classification accuracy of retinal eye diseases and reduce the required +computing time. The dataset used in this work is the Ocular Disease Intelligent +Recognition (ODIR) dataset which is a structured ophthalmic database with eight +classes covering most of the common retinal eye diseases. The evaluation +metrics used in this work include precision, recall, accuracy, and F1 score. In +this work, a comparative study has been made between normal ResNet models and +dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, +ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as +compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, +and 0.70 respectively for the above respective variants in ODIR multiclass +disease classification. -摘要:多模態多語言音訊文字檢索 (ML-ATR) 是一項具有挑戰性的任務,旨在從資料庫中檢索音訊片段或多語言文字。然而,現有的 ML-ATR 架構存在不一致的情況,例如跨語言的相似性比對。我們在理論上分析了不一致性,包括多模態多語言對齊方向誤差和權重誤差,並提出理論權重誤差上限以量化不一致性。根據權重誤差上限的分析,我們發現不一致性問題源於由語言隨機取樣造成的資料分佈誤差。我們提出一個一致的 ML-ATR 架構,採用 1 對 k 對比學習和音訊-英語共同錨點對比學習,旨在減輕資料分佈誤差對 ML-ATR 中召回率和一致性的負面影響。在已翻譯的 AudioCaps 和 Clotho 資料集上的實驗結果顯示,我們的架構在包括英語在內的八種主流語言的召回率和一致性指標上達到了最先進的效能。我們的程式碼將在 https://github.com/ATRI-ACL/ATRI-ACL 中提供。 +摘要:这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器(扩张 ResNet),以改善感知场,从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具,并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化,从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关,在该领域,对 AI 应用的透明度需求不断增长,以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品,以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集,这是一个结构化的眼科数据库,包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中,对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比,扩张 ResNet 模型显示出有希望的结果,在 ODIR 多类疾病分类中,上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。 -##### **Multi-Record Web Page Information Extraction From News Websites** -2502.14625v1 by Alexander Kustenkov, Maksim Varlamov, Alexander Yatskov +##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis** +2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li -In this paper, we focused on the problem of extracting information from web -pages containing many records, a task of growing importance in the era of -massive web data. Recently, the development of neural network methods has -improved the quality of information extraction from web pages. Nevertheless, -most of the research and datasets are aimed at studying detailed pages. This -has left multi-record "list pages" relatively understudied, despite their -widespread presence and practical significance. - To address this gap, we created a large-scale, open-access dataset -specifically designed for list pages. This is the first dataset for this task -in the Russian language. Our dataset contains 13,120 web pages with news lists, -significantly exceeding existing datasets in both scale and complexity. Our -dataset contains attributes of various types, including optional and -multi-valued, providing a realistic representation of real-world list pages. -These features make our dataset a valuable resource for studying information -extraction from pages containing many records. - Furthermore, we proposed our own multi-stage information extraction methods. -In this work, we explore and demonstrate several strategies for applying -MarkupLM to the specific challenges of multi-record web pages. Our experiments -validate the advantages of our methods. - By releasing our dataset to the public, we aim to advance the field of -information extraction from multi-record pages. +The rapid advancement of foundation models in medical imaging represents a +significant leap toward enhancing diagnostic accuracy and personalized +treatment. However, the deployment of foundation models in healthcare +necessitates a rigorous examination of their trustworthiness, encompassing +privacy, robustness, reliability, explainability, and fairness. The current +body of survey literature on foundation models in medical imaging reveals +considerable gaps, particularly in the area of trustworthiness. Additionally, +existing surveys on the trustworthiness of foundation models do not adequately +address their specific variations and applications within the medical imaging +domain. This survey aims to fill that gap by presenting a novel taxonomy of +foundation models used in medical imaging and analyzing the key motivations for +ensuring their trustworthiness. We review current research on foundation models +in major medical imaging applications, focusing on segmentation, medical report +generation, medical question and answering (Q\&A), and disease diagnosis. These +areas are highlighted because they have seen a relatively mature and +substantial number of foundation models compared to other applications. We +focus on literature that discusses trustworthiness in medical image analysis +manuscripts. We explore the complex challenges of building trustworthy +foundation models for each application, summarizing current concerns and +strategies for enhancing trustworthiness. Furthermore, we examine the potential +of these models to revolutionize patient care. Our analysis underscores the +imperative for advancing towards trustworthy AI in medical image analysis, +advocating for a balanced approach that fosters innovation while ensuring +ethical and equitable healthcare delivery. -摘要:在本文中,我們專注於從包含大量記錄的網頁中提取資訊的問題,這項任務在海量網路資料的時代中越來越重要。最近,神經網路方法的發展已改善從網頁中提取資訊的品質。儘管如此,大多數的研究和資料集都旨在研究詳細的網頁。儘管多記錄「清單網頁」廣泛存在且具有實用意義,但它們相對來說研究較少。 -為了解決這個差距,我們建立了一個專門針對清單網頁設計的大規模、開放存取的資料集。這是俄語中第一個針對此任務的資料集。我們的資料集包含 13,120 個包含新聞清單的網頁,在規模和複雜度上都遠遠超過現有的資料集。我們的資料集包含各種類型的屬性,包括可選和多值,提供真實世界清單網頁的實際表示。這些特點使我們的資料集成為研究從包含大量記錄的網頁中提取資訊的寶貴資源。 -此外,我們提出了我們自己的多階段資訊提取方法。在這項工作中,我們探討並展示了將 MarkupLM 應用於多記錄網頁特定挑戰的幾種策略。我們的實驗驗證了我們方法的優點。 -透過向公眾發布我們的資料集,我們旨在推進從多記錄網頁中提取資訊的領域。 +摘要:基礎模型在醫學影像方面的快速進展,代表著在加強診斷準確性和個人化治療方面邁出一大步。然而,基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查,包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距,特別是在可信度方面。此外,現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機,來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究,重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調,是因為與其他應用相比,它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰,總結了當前關注點和增強可信度的策略。此外,我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性,並倡導一種平衡的方法,既能促進創新,又能確保道德和公平的醫療保健服務。 + +##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data** +2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan + +Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and +interpreting ultrasound scans right at the patient's bedside. However, the +expertise needed to interpret these images is considerable and may not always +be present in emergency situations. This reality makes algorithms such as +machine learning classifiers extremely valuable to augment human decisions. +POCUS devices are becoming available at a reasonable cost in the size of a +mobile phone. The challenge of turning POCUS devices into life-saving tools is +that interpretation of ultrasound images requires specialist training and +experience. Unfortunately, the difficulty to obtain positive training images +represents an important obstacle to building efficient and accurate +classifiers. Hence, the problem we try to investigate is how to explore +strategies to increase accuracy of classifiers trained with scarce data. We +hypothesize that training with a few data instances may not suffice for +classifiers to generalize causing them to overfit. Our approach uses an +Explainable AI-Augmented approach to help the algorithm learn more from less +and potentially help the classifier better generalize. -##### **Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity** -2502.14620v1 by Xinghan Pan +摘要:床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而,解讀這些影像所需的專業知識相當可觀,而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出,尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於,解讀超音波影像需要專門訓練和經驗。不幸的是,取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此,我們嘗試探討的問題是如何探索策略,以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括,導致它們過度擬合。我們的做法使用可解釋 AI 增強方法,以協助演算法從較少的資料中學習更多,並潛在協助分類器更好地概括。 -This paper investigates the efficacy of RWKV, a novel language model -architecture known for its linear attention mechanism, for generating sentence -embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate -the semantic similarity captured by embeddings from different hidden layers of -a pre-trained RWKV model. The performance is assessed on the Microsoft Research -Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared -against a GloVe-based baseline. My results indicate that while RWKV embeddings -capture some semantic relatedness, they underperform compared to the GloVe -baseline in terms of Spearman correlation. I also analyze the inference time -and GPU memory usage, highlighting the computational trade-offs associated with -RWKV embeddings. The findings suggest that while RWKV offers potential -advantages in terms of linear scaling, its zero-shot sentence embedding quality -for semantic similarity tasks requires further investigation and potential -task-specific fine-tuning to match or exceed simpler baselines. +##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach** +2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang -摘要:本文探討 RWKV 的效能,這是一種以線性注意力機制聞名的語言模型架構,可用於在零次學習設定中產生句子嵌入。我進行逐層分析,以評估預先訓練的 RWKV 模型中不同隱藏層的嵌入所擷取的語義相似性。效能評估使用 Microsoft Research Paraphrase Corpus (MRPC) 資料集,採用 Spearman 相關係數,並與基於 GloVe 的基準進行比較。我的結果顯示,雖然 RWKV 嵌入可以擷取一些語義相關性,但與 GloVe 基準相比,在 Spearman 相關係數方面表現不佳。我也分析了推論時間和 GPU 記憶體使用量,強調與 RWKV 嵌入相關的運算折衷。這些發現表明,雖然 RWKV 在線性縮放方面具有潛在優勢,但其在語義相似性任務中的零次學習句子嵌入品質需要進一步探討,並需要潛在的特定任務微調,才能達到或超越較簡單的基準。 +In recent years, the United States has witnessed a significant surge in the +popularity of vaping or e-cigarette use, leading to a notable rise in cases of +e-cigarette and vaping use-associated lung injury (EVALI) that caused +hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting +the urgency to comprehend vaping behaviors and develop effective strategies for +cessation. Due to the ubiquity of social media platforms, over 4.7 billion +users worldwide use them for connectivity, communications, news, and +entertainment with a significant portion of the discourse related to health, +thereby establishing social media data as an invaluable organic data resource +for public health research. In this study, we extracted a sample dataset from +one vaping sub-community on Reddit to analyze users' quit-vaping intentions. +Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit +vaping intention detection, this study compares the outcomes of this model +against layman and clinical expert annotations. Using different prompting +strategies such as zero-shot, one-shot, few-shot and chain-of-thought +prompting, we developed 8 prompts with varying levels of detail to explain the +task to GPT-4 and also evaluated the performance of the strategies against each +other. These preliminary findings emphasize the potential of GPT-4 in social +media data analysis, especially in identifying users' subtle intentions that +may elude human detection. -##### **Reward Models Identify Consistency, Not Causality** -2502.14619v1 by Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li +摘要:近年來,美國見證了電子煙或電子香菸使用率大幅激增,導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加,在 2019 年 EVALI 爆發期間造成住院和死亡,凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及,全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂,其中很大一部分與健康相關,因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中,我們從 Reddit 上一個電子煙子社群中提取一個範例資料集,以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測,本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略,例如零次學習、一次學習、少次學習和思考鏈提示,我們開發了 8 個提示,詳細程度不同,向 GPT-4 解釋任務,並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力,特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。 -Reward models (RMs) play a crucial role in aligning large language models -(LLMs) with human preferences and enhancing reasoning quality. Traditionally, -RMs are trained to rank candidate outputs based on their correctness and -coherence. However, in this work, we present several surprising findings that -challenge common assumptions about RM behavior. Our analysis reveals that -state-of-the-art reward models prioritize structural consistency over causal -correctness. Specifically, removing the problem statement has minimal impact on -reward scores, whereas altering numerical values or disrupting the reasoning -flow significantly affects RM outputs. Furthermore, RMs exhibit a strong -dependence on complete reasoning trajectories truncated or incomplete steps -lead to significant variations in reward assignments, indicating that RMs -primarily rely on learned reasoning patterns rather than explicit problem -comprehension. These findings hold across multiple architectures, datasets, and -tasks, leading to three key insights: (1) RMs primarily assess coherence rather -than true reasoning quality; (2) The role of explicit problem comprehension in -reward assignment is overstated; (3) Current RMs may be more effective at -ranking responses than verifying logical validity. Our results suggest a -fundamental limitation in existing reward modeling approaches, emphasizing the -need for a shift toward causality-aware reward models that go beyond -consistency-driven evaluation. +##### **Towards Compositional Interpretability for XAI** +2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke -摘要:獎勵模型 (RM) 在將大型語言模型 (LLM) 與人類偏好對齊並提升推理品質方面扮演至關重要的角色。傳統上,RM 會訓練來根據候選輸出的正確性和一致性進行排名。然而,在這項工作中,我們提出幾個令人驚訝的發現,挑戰了關於 RM 行為的常見假設。我們的分析顯示,最先進的獎勵模型優先考慮結構一致性,而不是因果正確性。具體來說,移除問題陳述對獎勵分數的影響很小,而改變數值或中斷推理流程則會顯著影響 RM 輸出。此外,RM 表現出對完整推理軌跡的強烈依賴性,截斷或不完整的步驟會導致獎勵分配產生重大變化,這表示 RM 主要依賴於學習到的推理模式,而不是明確的問題理解。這些發現適用於多種架構、資料集和任務,得出三個關鍵見解:(1) RM 主要評估一致性,而不是真正的推理品質;(2) 在獎勵分配中,明確問題理解的角色被誇大了;(3) 目前的 RM 在排名回應方面可能比驗證邏輯有效性更有效。我們的結果表明現有獎勵建模方法存在根本限制,強調需要轉向因果感知獎勵模型,超越以一致性為導向的評估。 +Artificial intelligence (AI) is currently based largely on black-box machine +learning models which lack interpretability. The field of eXplainable AI (XAI) +strives to address this major concern, being critical in high-stakes areas such +as the finance, legal and health sectors. + We present an approach to defining AI models and their interpretability based +on category theory. For this we employ the notion of a compositional model, +which sees a model in terms of formal string diagrams which capture its +abstract structure together with its concrete implementation. This +comprehensive view incorporates deterministic, probabilistic and quantum +models. We compare a wide range of AI models as compositional models, including +linear and rule-based models, (recurrent) neural networks, transformers, VAEs, +and causal and DisCoCirc models. + Next we give a definition of interpretation of a model in terms of its +compositional structure, demonstrating how to analyse the interpretability of a +model, and using this to clarify common themes in XAI. We find that what makes +the standard 'intrinsically interpretable' models so transparent is brought out +most clearly diagrammatically. This leads us to the more general notion of +compositionally-interpretable (CI) models, which additionally include, for +instance, causal, conceptual space, and DisCoCirc models. + We next demonstrate the explainability benefits of CI models. Firstly, their +compositional structure may allow the computation of other quantities of +interest, and may facilitate inference from the model to the modelled +phenomenon by matching its structure. Secondly, they allow for diagrammatic +explanations for their behaviour, based on influence constraints, diagram +surgery and rewrite explanations. Finally, we discuss many future directions +for the approach, raising the question of how to learn such meaningfully +structured models in practice. -##### **FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis** -2502.14614v1 by Mingyi Jia, Junwen Duan, Yan Song, Jianxin Wang +摘要:人工智慧(AI)目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧(XAI)領域致力於解決這個主要問題,這在金融、法律和健康等高風險領域至關重要。 +我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此,我們採用組合模型的概念,它以形式弦圖的形式看待模型,這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較,包括線性和基於規則的模型、(遞迴)神經網路、Transformer、VAE,以及因果和 DisCoCirc 模型。 +接下來,我們根據模型的組合結構給出模型解釋的定義,展示如何分析模型的可解釋性,並使用它來澄清 XAI 中的常見主題。我們發現,讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋(CI)模型概念,它另外還包括因果、概念空間和 DisCoCirc 模型。 +接下來,我們展示了 CI 模型的可解釋性優勢。首先,它們的組合結構允許計算其他感興趣的量,並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次,它們允許對其行為進行圖解說明,這些說明基於影響約束、圖解手術和重寫說明。最後,我們討論了這種方法的許多未來方向,提出了如何在實踐中學習這種有意義的結構化模型的問題。 -Retrieval-Augmented Large Language Models (LLMs), which integrate external -knowledge into LLMs, have shown remarkable performance in various medical -domains, including clinical diagnosis. However, existing RAG methods struggle -to effectively assess task difficulty to make retrieval decisions, thereby -failing to meet the clinical requirements for balancing efficiency and -accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained -\textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework -that improves the reliability of RAG in disease diagnosis scenarios. FIND -incorporates a fine-grained adaptive control module to determine whether -retrieval is necessary based on the information density of the input. By -optimizing the retrieval process and implementing a knowledge filtering module, -FIND ensures that the retrieval is better suited to clinical scenarios. -Experiments on three Chinese electronic medical record datasets demonstrate -that FIND significantly outperforms various baseline methods, highlighting its -effectiveness in clinical diagnosis tasks. +##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods** +2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen -摘要:檢索增強大型語言模型 (LLM),將外部知識整合至 LLM,已於各種醫療領域展現出卓越效能,包括臨床診斷。然而,現有的 RAG 方法難以有效評估任務難度以做出檢索決策,因此無法滿足平衡效率和精確度的臨床需求。因此,我們在本文中提出 FIND(**F**ine-grained **In**formation **D**ensity Guided Adaptive RAG),一種新穎架構,可提升 RAG 在疾病診斷場景中的可靠性。FIND 整合一個細緻化的自適應控制模組,根據輸入的資訊密度判斷是否需要檢索。透過最佳化檢索程序並實作一個知識過濾模組,FIND 確保檢索更適合臨床場景。在三個中文電子病歷資料集上的實驗顯示,FIND 明顯優於各種基線方法,突顯其在臨床診斷任務中的有效性。 +Machine learning models have achieved high overall accuracy in medical image +analysis. However, performance disparities on specific patient groups pose +challenges to their clinical utility, safety, and fairness. This can affect +known patient groups - such as those based on sex, age, or disease subtype - as +well as previously unknown and unlabeled groups. Furthermore, the root cause of +such observed performance disparities is often challenging to uncover, +hindering mitigation efforts. In this paper, to address these issues, we +leverage Slice Discovery Methods (SDMs) to identify interpretable +underperforming subsets of data and formulate hypotheses regarding the cause of +observed performance disparities. We introduce a novel SDM and apply it in a +case study on the classification of pneumothorax and atelectasis from chest +x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis +formulation and yields an explanation of previously observed but unexplained +performance disparities between male and female patients in widely used chest +X-ray datasets and models. Our findings indicate shortcut learning in both +classification tasks, through the presence of chest drains and ECG wires, +respectively. Sex-based differences in the prevalence of these shortcut +features appear to cause the observed classification performance gap, +representing a previously underappreciated interaction between shortcut +learning and model fairness analyses. -##### **Behavioral Analysis of Information Salience in Large Language Models** -2502.14613v1 by Jan Trienes, Jörg Schlötterer, Junyi Jessy Li, Christin Seifert +摘要:機器學習模型在醫學影像分析中已達到整體高準確度。然而,特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體(例如基於性別、年齡或疾病亞型)以及先前未知且未標籤的群體。此外,此類觀察到的效能差異的根本原因通常難以發現,阻礙了緩解措施。在本文中,為了解決這些問題,我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集,並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM,並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性,並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明,在分類任務中,透過胸腔引流管和心電圖導線的存在,存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異,似乎會導致觀察到的分類效能差距,這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。 -Large Language Models (LLMs) excel at text summarization, a task that -requires models to select content based on its importance. However, the exact -notion of salience that LLMs have internalized remains unclear. To bridge this -gap, we introduce an explainable framework to systematically derive and -investigate information salience in LLMs through their summarization behavior. -Using length-controlled summarization as a behavioral probe into the content -selection process, and tracing the answerability of Questions Under Discussion -throughout, we derive a proxy for how models prioritize information. Our -experiments on 13 models across four datasets reveal that LLMs have a nuanced, -hierarchical notion of salience, generally consistent across model families and -sizes. While models show highly consistent behavior and hence salience -patterns, this notion of salience cannot be accessed through introspection, and -only weakly correlates with human perceptions of information salience. +##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health** +2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa -摘要:大型語言模型 (LLM) 在文字摘要方面表現出色,這項任務需要模型根據重要性來選擇內容。然而,LLM 內化的顯著性準確概念仍不清楚。為了彌補這個差距,我們引入了一個可解釋的架構,透過摘要行為系統性地推導和調查 LLM 中的資訊顯著性。使用長度控制摘要作為行為探測來探討內容選擇過程,並追蹤討論中問題的可回答性,我們推導出一個模型優先處理資訊的方式代理。我們針對四個資料集中的 13 個模型進行的實驗揭示,LLM 具有細緻入微、階層式的顯著性概念,通常在模型系列和大小之間保持一致。雖然模型表現出高度一致的行為,因此具有顯著性模式,但這個顯著性概念無法透過內省來存取,而且與人類對資訊顯著性的認知僅有微弱相關性。 +The concept of Metaverse has attracted a lot of attention in various fields +and one of its important applications is health and treatment. The Metaverse +has enormous potential to transform healthcare by changing patient care, +medical education, and the way teaching/learning and research are done. The +purpose of this research is to provide an introduction to the basic concepts +and fundamental technologies of the Metaverse. This paper examines the pros and +cons of the Metaverse in healthcare context and analyzes its potential from the +technology and AI perspective. In particular, the role of machine learning +methods is discussed; We will explain how machine learning algorithms can be +applied to the Metaverse generated data to gain better insights in healthcare +applications. Additionally, we examine the future visions of the Metaverse in +health delivery, by examining emerging technologies such as blockchain and also +addressing privacy concerns. The findings of this study contribute to a deeper +understanding of the applications of Metaverse in healthcare and its potential +to revolutionize the delivery of medical services. -##### **A Theory for Conditional Generative Modeling on Multiple Data Sources** -2502.14583v1 by Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu +摘要:元宇宙的概念在各個領域都備受關注,其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育,以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點,並從技術和 AI 的角度分析其潛力。特別是,討論了機器學習方法的角色;我們將說明如何將機器學習演算法應用於元宇宙產生的資料,以獲得醫療保健應用方面的更佳見解。此外,我們透過探討區塊鏈等新興技術,並解決隱私問題,來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用,以及其在醫療服務提供方面發揮革命性變革的潛力。 -The success of large generative models has driven a paradigm shift, -leveraging massive multi-source data to enhance model capabilities. However, -the interaction among these sources remains theoretically underexplored. This -paper takes the first step toward a rigorous analysis of multi-source training -in conditional generative modeling, where each condition represents a distinct -data source. Specifically, we establish a general distribution estimation error -bound in average total variation distance for conditional maximum likelihood -estimation based on the bracketing number. Our result shows that when source -distributions share certain similarities and the model is expressive enough, -multi-source training guarantees a sharper bound than single-source training. -We further instantiate the general theory on conditional Gaussian estimation -and deep generative models including autoregressive and flexible energy-based -models, by characterizing their bracketing numbers. The results highlight that -the number of sources and similarity among source distributions improve the -advantage of multi-source training. Simulations and real-world experiments -validate our theory. Code is available at: -\url{https://github.com/ML-GSAI/Multi-Source-GM}. +##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI** +2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf -摘要:大型生成模型的成功推動了範例轉移,利用大量多來源資料來增強模型功能。然而,這些來源之間的互動在理論上仍未得到充分探討。本文踏出了嚴謹分析條件生成模型中多來源訓練的第一步,其中每個條件代表一個不同的資料來源。具體來說,我們建立了一個基於括號數的條件最大似然估計的平均總變異距離中的通用分佈估計誤差界限。我們的結果表明,當來源分佈具有一定的相似性且模型具有足夠的表達力時,多來源訓練保證了比單來源訓練更嚴格的界限。我們進一步在條件高斯估計和深度生成模型(包括自迴歸和靈活的基於能量的模型)上例證了通用理論,通過表徵它們的括號數。結果強調了來源數和來源分佈之間的相似性提高了多來源訓練的優勢。模擬和真實世界的實驗驗證了我們的理論。程式碼可在以下網址取得:\url{https://github.com/ML-GSAI/Multi-Source-GM}。 +Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with +no known ultimo cure and high morbidity. Research demonstrates that progressive +Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly +impacts kidney structure and functions, eventually leading to kidney failure. +With the progression of time, chronic kidney disease has moved from a +life-threatening disease affecting few people to a common disorder of varying +severity. The goal of this research is to visualize dominating features, +feature scores, and values exhibited for early prognosis and detection of CKD +using ensemble learning and explainable AI. For that, an AI-driven predictive +analytics approach is proposed to aid clinical practitioners in prescribing +lifestyle modifications for individual patients to reduce the rate of +progression of this disease. Our dataset is collected on body vitals from +individuals with CKD and healthy subjects to develop our proposed AI-driven +solution accurately. In this regard, blood and urine test results are provided, +and ensemble tree-based machine-learning models are applied to predict unseen +cases of CKD. Our research findings are validated after lengthy consultations +with nephrologists. Our experiments and interpretation results are compared +with existing explainable AI applications in various healthcare domains, +including CKD. The comparison shows that our developed AI models, particularly +the Random Forest model, have identified more features as significant +contributors than XgBoost. Interpretability (I), which measures the ratio of +important to masked features, indicates that our XgBoost model achieved a +higher score, specifically a Fidelity of 98\%, in this metric and naturally in +the FII index compared to competing models. -##### **A Statistical Case Against Empirical Human-AI Alignment** -2502.14581v1 by Julian Rodemann, Esteban Garces Arias, Christoph Luther, Christoph Jansen, Thomas Augustin +摘要:慢性腎臟病 (CKD) 是一種廣泛的慢性疾病,目前尚未找到最終的治療方法,且發病率很高。研究表明,進行性慢性腎臟病 (CKD) 是一種異質性疾病,會顯著影響腎臟結構和功能,最終導致腎衰竭。隨著時間的推移,慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值,以進行 CKD 的早期預後和檢測。為此,提出了一種 AI 驅動的預測分析方法,以幫助臨床醫生為個別患者開具生活方式的修改建議,以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的,以準確開發我們提出的 AI 驅動的解決方案。在這方面,提供了血液和尿液檢測結果,並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較,包括 CKD。比較表明,我們開發的 AI 模型,特別是隨機森林模型,已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率,表明我們的 XgBoost 模型在此指標中取得了更高的分數,特別是 98% 的保真度,並且在 FII 指數中自然高於競爭模型。 -Empirical human-AI alignment aims to make AI systems act in line with -observed human behavior. While noble in its goals, we argue that empirical -alignment can inadvertently introduce statistical biases that warrant caution. -This position paper thus advocates against naive empirical alignment, offering -prescriptive alignment and a posteriori empirical alignment as alternatives. We -substantiate our principled argument by tangible examples like human-centric -decoding of language models. +##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook** +2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan -摘要:經驗主義的人工智慧校準旨在使人工智慧系統根據觀察到的人類行為採取行動。儘管目標崇高,我們認為經驗主義校準可能會無意中引入需要謹慎對待的統計偏差。因此,本立場文件主張反對天真的經驗主義校準,提供規範性校準和後驗經驗主義校準作為替代方案。我們以具體的例子(例如以人為中心的語言模型解碼)來證明我們的原則性論點。 +Mental health constitutes a complex and pervasive global challenge, affecting +millions of lives and often leading to severe consequences. In this paper, we +conduct a thorough survey to explore the intersection of data science, +artificial intelligence, and mental healthcare, focusing on the recent +developments of mental disorder detection through online social media (OSM). A +significant portion of the population actively engages in OSM platforms, +creating a vast repository of personal data that holds immense potential for +mental health analytics. The paper navigates through traditional diagnostic +methods, state-of-the-art data- and AI-driven research studies, and the +emergence of explainable AI (XAI) models for mental healthcare. We review +state-of-the-art machine learning methods, particularly those based on modern +deep learning, while emphasising the need for explainability in healthcare AI +models. The experimental design section provides insights into prevalent +practices, including available datasets and evaluation approaches. We also +identify key issues and challenges in the field and propose promising future +research directions. As mental health decisions demand transparency, +interpretability, and ethical considerations, this paper contributes to the +ongoing discourse on advancing XAI in mental healthcare through social media. +The comprehensive overview presented here aims to guide researchers, +practitioners, and policymakers in developing the area of mental disorder +detection. -##### **ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification** -2502.14565v1 by Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack +摘要:心理健康構成了一項複雜且普遍的全球挑戰,影響了數百萬人的生活,並經常導致嚴重的後果。在本文中,我們進行了一項徹底的調查,以探索數據科學、人工智慧和心理保健的交集,重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台,創造了一個龐大的人員資料庫,對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究,以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法,特別是那些基於現代深度學習的方法,同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解,包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰,並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量,本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。 -Self-awareness, i.e., the ability to assess and correct one's own generation, -is a fundamental aspect of human intelligence, making its replication in large -language models (LLMs) an important yet challenging task. Previous works tackle -this by employing extensive reinforcement learning or rather relying on large -external verifiers. In this work, we propose Refine via Intrinsic -Self-Verification (ReVISE), an efficient and effective framework that enables -LLMs to self-correct their outputs through self-verification. The core idea of -ReVISE is to enable LLMs to verify their reasoning processes and continually -rethink reasoning trajectories based on its verification. We introduce a -structured curriculum based upon online preference learning to implement this -efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., -self-verification and reasoning correction), we tackle each task sequentially -using curriculum learning, collecting both failed and successful reasoning -paths to construct preference pairs for efficient training. During inference, -our approach enjoys natural test-time scaling by integrating self-verification -and correction capabilities, further enhanced by our proposed confidence-aware -decoding mechanism. Our experiments on various reasoning tasks demonstrate that -ReVISE achieves efficient self-correction and significantly improves reasoning -performance. +##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance** +2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou -摘要:自我覺察,亦即評估和修正自身產出的能力,是人類智慧的基本面向,使其能在大型語言模型 (LLM) 中複製,是一項重要且具挑戰性的任務。先前的研究透過採用廣泛的強化學習或依賴大型外部驗證器來解決這個問題。在這項研究中,我們提出透過內在自我驗證 (ReVISE) 進行精煉,一個有效率且有效的架構,使 LLM 能透過自我驗證來自我修正其產出。ReVISE 的核心概念是讓 LLM 能驗證其推理過程,並根據驗證結果持續重新思考推理軌跡。我們導入一個建構於線上偏好學習的結構化課程,以有效率地實作這項功能。具體來說,由於 ReVISE 涉及兩項具有挑戰性的任務(即自我驗證和推理修正),我們使用課程學習循序漸進地處理每一項任務,收集失敗和成功的推理路徑,以建構偏好對,進行有效率的訓練。在推論期間,我們的作法透過整合自我驗證和修正功能,享有自然的測試時間擴充,並進一步透過我們提出的具備信心感知的解碼機制進行強化。我們在各種推理任務上的實驗顯示,ReVISE 達到有效率的自我修正,並顯著提升推理效能。 +AI-aided clinical diagnosis is desired in medical care. Existing deep +learning models lack explainability and mainly focus on image analysis. The +recently developed Dynamic Uncertain Causality Graph (DUCG) approach is +causality-driven, explainable, and invariant across different application +scenarios, without problems of data collection, labeling, fitting, privacy, +bias, generalization, high cost and high energy consumption. Through close +collaboration between clinical experts and DUCG technicians, 46 DUCG models +covering 54 chief complaints were constructed. Over 1,000 diseases can be +diagnosed without triage. Before being applied in real-world, the 46 DUCG +models were retrospectively verified by third-party hospitals. The verified +diagnostic precisions were no less than 95%, in which the diagnostic precision +for every disease including uncommon ones was no less than 80%. After +verifications, the 46 DUCG models were applied in the real-world in China. Over +one million real diagnosis cases have been performed, with only 17 incorrect +diagnoses identified. Due to DUCG's transparency, the mistakes causing the +incorrect diagnoses were found and corrected. The diagnostic abilities of the +clinicians who applied DUCG frequently were improved significantly. Following +the introduction to the earlier presented DUCG methodology, the recommendation +algorithm for potential medical checks is presented and the key idea of DUCG is +extracted. -##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** -2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao +摘要:醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性,並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的,並且在不同的應用場景中是不變的,沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作,構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前,46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%,其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後,46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例,僅發現 17 個不正確的診斷。由於 DUCG 的透明性,發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後,提出了潛在健康檢查的推薦演算法,並提取了 DUCG 的關鍵思想。 -Large Language Models (LLMs) have demonstrated exceptional abilities in -reasoning for task planning. However, challenges remain under-explored for -parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in -which the model first decomposes a real-life textual task into executable -subtasks and constructs an abstract task graph. The model then understands this -task graph as input and generates a plan for parallel execution. To enhance the -planning capability of complex, scalable graphs, we design an automated and -controllable pipeline to generate synthetic graphs and propose a two-stage -training scheme. Experimental results show that our plan-over-graph method -significantly improves task performance on both API-based LLMs and trainable -open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally -supports parallel execution, demonstrating global efficiency. The code and data -are available at https://github.com/zsq259/Plan-over-Graph. +##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability** +2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi -摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 +It is imperative that breast cancer is detected precisely and timely to +improve patient outcomes. Diagnostic methodologies have traditionally relied on +unimodal approaches; however, medical data analytics is integrating diverse +data sources beyond conventional imaging. Using multi-modal techniques, +integrating both image and non-image data, marks a transformative advancement +in breast cancer diagnosis. The purpose of this review is to explore the +burgeoning field of multimodal techniques, particularly the fusion of +histopathology images with non-image data. Further, Explainable AI (XAI) will +be used to elucidate the decision-making processes of complex algorithms, +emphasizing the necessity of explainability in diagnostic processes. This +review utilizes multi-modal data and emphasizes explainability to enhance +diagnostic accuracy, clinician confidence, and patient engagement, ultimately +fostering more personalized treatment strategies for breast cancer, while also +identifying research gaps in multi-modality and explainability, guiding future +studies, and contributing to the strategic direction of the field. -##### **Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs** -2502.14561v1 by Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos +摘要:精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法;然而,醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術,標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域,特別是將組織病理學影像與非影像資料融合。此外,可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程,強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性,以提高診斷準確性、臨床醫師的信心和患者參與度,最終促進乳癌更個人化的治療策略,同時也找出多模式和可解釋性的研究差距,引導未來的研究,並為該領域的策略方向做出貢獻。 -This work investigates the ability of open Large Language Models (LLMs) to -predict citation intent through in-context learning and fine-tuning. Unlike -traditional approaches that rely on pre-trained models like SciBERT, which -require extensive domain-specific pretraining and specialized architectures, we -demonstrate that general-purpose LLMs can be adapted to this task with minimal -task-specific data. We evaluate twelve model variations across five prominent -open LLM families using zero, one, few, and many-shot prompting to assess -performance across scenarios. Our experimental study identifies the -top-performing model through extensive experimentation of in-context -learning-related parameters, which we fine-tune to further enhance task -performance. The results highlight the strengths and limitations of LLMs in -recognizing citation intents, providing valuable insights for model selection -and prompt engineering. Additionally, we make our end-to-end evaluation -framework and models openly available for future use. +##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection** +2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya -摘要:本研究探討開放式大型語言模型 (LLM) 透過情境學習和微調來預測引文意圖的能力。與依賴於預訓練模型(例如 SciBERT)的傳統方法不同,後者需要廣泛的特定領域預訓練和專業架構,我們證明了通用 LLM 可以使用最少的特定任務數據來適應此任務。我們使用零次、一次、少次和多次提示評估五個著名的開放式 LLM 家族中的十二個模型變體,以評估不同場景的效能。我們的實驗研究透過廣泛的實驗來識別情境學習相關參數中效能最佳的模型,我們微調這些參數以進一步增強任務效能。結果突顯了 LLM 在識別引文意圖方面的優點和限制,為模型選擇和提示工程提供了有價值的見解。此外,我們將端到端評估架構和模型公開供未來使用。 +The neonatal period is the most vulnerable time for the development of +seizures. Seizures in the immature brain lead to detrimental consequences, +therefore require early diagnosis. The gold-standard for neonatal seizure +detection currently relies on continuous video-EEG monitoring; which involves +recording multi-channel electroencephalogram (EEG) alongside real-time video +monitoring within a neonatal intensive care unit (NICU). However, video-EEG +monitoring technology requires clinical expertise and is often limited to +technologically advanced and resourceful settings. Cost-effective new +techniques could help the medical fraternity make an accurate diagnosis and +advocate treatment without delay. In this work, a novel explainable deep +learning model to automate the neonatal seizure detection process with a +reduced EEG montage is proposed, which employs convolutional nets, graph +attention layers, and fully connected layers. Beyond its ability to detect +seizures in real-time with a reduced montage, this model offers the unique +advantage of real-time interpretability. By evaluating the performance on the +Zenodo dataset with 10-fold cross-validation, the presented model achieves an +absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall, +respectively. -##### **Less is More: Improving LLM Alignment via Preference Data Selection** -2502.14560v1 by Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He +摘要:新生兒期是大腦發育最脆弱的時期,容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果,因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測;其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而,視訊腦電圖監控技術需要臨床專業知識,而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中,提出了一個新穎的可解釋深度學習模型,以自動化新生兒癲癇發作偵測過程,並採用減少的腦電圖裝置,其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外,此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能,所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。 -Direct Preference Optimization (DPO) has emerged as a promising approach for -aligning large language models with human preferences. While prior work mainly -extends DPO from the aspect of the objective function, we instead improve DPO -from the largely overlooked but critical aspect of data selection. -Specifically, we address the issue of parameter shrinkage caused by noisy data -by proposing a novel margin-maximization principle for dataset curation in DPO -training. To accurately estimate margins for data selection, we propose a -dual-margin guided approach that considers both external reward margins and -implicit DPO reward margins. Extensive experiments demonstrate that our method -reduces computational cost dramatically while improving performance. -Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach -achieves 3\% to 8\% improvements across various Llama and Mistral series models -on the AlpacaEval 2.0 benchmark. Furthermore, our approach seamlessly extends -to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, -while further reducing training time. These results highlight the potential of -data selection strategies for advancing preference optimization. +##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques** +2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik -摘要:直接偏好最佳化 (DPO) 已成為一種有希望的方法,可將大型語言模型與人類偏好保持一致。雖然先前的研究主要從目標函數的角度延伸 DPO,但我們反而從資料選擇這個極易被忽略但至關重要的角度改進 DPO。 -具體來說,我們透過提出一個用於 DPO 訓練中資料集整理的新邊際最大化原則,來解決由雜訊資料造成的參數收縮問題。為了準確估計資料選擇的邊際,我們提出一個雙邊際引導方法,它同時考慮外部獎勵邊際和隱含 DPO 獎勵邊際。大規模的實驗證明,我們的這種方法大幅降低了運算成本,同時改善了效能。 -值得注意的是,我們的這種方法僅使用 Ultrafeedback 資料集的 10%,便在 AlpacaEval 2.0 基準上,在各種 Llama 和 Mistral 系列模型中取得了 3% 到 8% 的改進。此外,我們的這種方法可以無縫地延伸到迭代 DPO,在使用 25% 線上資料的情況下產生了大約 3% 的改進,同時進一步減少了訓練時間。這些結果突顯了資料選擇策略在推進偏好最佳化方面的潛力。 +Breast cancer (BC) stands as one of the most common malignancies affecting +women worldwide, necessitating advancements in diagnostic methodologies for +better clinical outcomes. This article provides a comprehensive exploration of +the application of Explainable Artificial Intelligence (XAI) techniques in the +detection and diagnosis of breast cancer. As Artificial Intelligence (AI) +technologies continue to permeate the healthcare sector, particularly in +oncology, the need for transparent and interpretable models becomes imperative +to enhance clinical decision-making and patient care. This review discusses the +integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and +others, with machine learning and deep learning models utilized in breast +cancer detection and classification. By investigating the modalities of breast +cancer datasets, including mammograms, ultrasounds and their processing with +AI, the paper highlights how XAI can lead to more accurate diagnoses and +personalized treatment plans. It also examines the challenges in implementing +these techniques and the importance of developing standardized metrics for +evaluating XAI's effectiveness in clinical settings. Through detailed analysis +and discussion, this article aims to highlight the potential of XAI in bridging +the gap between complex AI models and practical healthcare applications, +thereby fostering trust and understanding among medical professionals and +improving patient outcomes. -##### **FUIA: Model Inversion Attack against Federated Unlearning** -2502.14558v1 by Lei Zhou, Youwen Zhu +摘要:乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一,因此需要進步的診斷方法,以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域,特別是在腫瘤學中,透明且可解釋的模型需求變得勢在必行,以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合,例如 SHAP、LIME、Grad-CAM 等,以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式,包括乳房攝影、超音波及其在 AI 中的處理,本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰,以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論,本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力,進而促進醫療專業人員之間的信任與理解,並改善患者的結果。 -With the introduction of regulations related to the ``right to be forgotten", -federated learning (FL) is facing new privacy compliance challenges. To address -these challenges, researchers have proposed federated unlearning (FU). However, -existing FU research has primarily focused on improving the efficiency of -unlearning, with less attention paid to the potential privacy vulnerabilities -inherent in these methods. To address this gap, we draw inspiration from -gradient inversion attacks in FL and propose the federated unlearning inversion -attack (FUIA). The FUIA is specifically designed for the three types of FU -(sample unlearning, client unlearning, and class unlearning), aiming to provide -a comprehensive analysis of the privacy leakage risks associated with FU. In -FUIA, the server acts as an honest-but-curious attacker, recording and -exploiting the model differences before and after unlearning to expose the -features and labels of forgotten data. FUIA significantly leaks the privacy of -forgotten data and can target all types of FU. This attack contradicts the goal -of FU to eliminate specific data influence, instead exploiting its -vulnerabilities to recover forgotten data and expose its privacy flaws. -Extensive experimental results show that FUIA can effectively reveal the -private information of forgotten data. To mitigate this privacy leakage, we -also explore two potential defense methods, although these come at the cost of -reduced unlearning effectiveness and the usability of the unlearned model. +##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition** +2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara -摘要:隨著「被遺忘權」相關法規的推出, -聯盟學習 (FL) 面臨新的隱私合規挑戰。為了應對 -這些挑戰,研究人員提出了聯盟取消學習 (FU)。然而, -現有的 FU 研究主要集中在提高取消學習的效率,較少關注這些方法中固有的潛在隱私漏洞。為了解決這個差距,我們從 -FL 中的梯度反演攻擊中汲取靈感,並提出聯盟取消學習反演 -攻擊 (FUIA)。FUIA 專門設計用於三種類型的 FU -(樣本取消學習、客戶端取消學習和類別取消學習),旨在提供 -對與 FU 相關的隱私洩露風險的全面分析。在 -FUIA 中,伺服器充當誠實但好奇的攻擊者,記錄並 -利用取消學習前後的模型差異來揭露遺忘資料的功能和標籤。FUIA 大幅洩露遺忘資料的隱私,並且可以針對所有類型的 FU。此攻擊與 FU 消除特定資料影響的目標相矛盾,而是利用其 -漏洞來恢復遺忘資料並揭露其隱私缺陷。廣泛的實驗結果表明 FUIA 可以有效揭露遺忘資料的私人資訊。為了減輕這種隱私洩露,我們 -還探索了兩種潛在的防禦方法,儘管這些方法以降低取消學習的有效性和已取消學習模型的可用性為代價。 +Speech emotion recognition (SER) has gained significant attention due to its +several application fields, such as mental health, education, and +human-computer interaction. However, the accuracy of SER systems is hindered by +high-dimensional feature sets that may contain irrelevant and redundant +information. To overcome this challenge, this study proposes an iterative +feature boosting approach for SER that emphasizes feature relevance and +explainability to enhance machine learning model performance. Our approach +involves meticulous feature selection and analysis to build efficient SER +systems. In addressing our main problem through model explainability, we employ +a feature evaluation loop with Shapley values to iteratively refine feature +sets. This process strikes a balance between model performance and +transparency, which enables a comprehensive understanding of the model's +predictions. The proposed approach offers several advantages, including the +identification and removal of irrelevant and redundant features, leading to a +more effective model. Additionally, it promotes explainability, facilitating +comprehension of the model's predictions and the identification of crucial +features for emotion determination. The effectiveness of the proposed method is +validated on the SER benchmarks of the Toronto emotional speech set (TESS), +Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of +Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion +(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our +knowledge, this is the first work to incorporate model explainability into an +SER framework. The source code of this paper is publicly available via this +https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition. -##### **Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling** -2502.14553v1 by Eric Egli, Matteo Manica, Jannis Born +摘要:語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而,SER 系統的準確性受到高維特徵集的阻礙,這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰,本研究提出了一種用於 SER 的迭代特徵提升方法,該方法強調特徵相關性和可解釋性,以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析,以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題,我們採用了具有 Shapley 值的特徵評估迴圈,以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡,這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點,包括識別和移除不相關和冗餘的特徵,從而建立更有效的模型。此外,它促進了可解釋性,有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證,其效能優於現有方法。據我們所知,這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得:https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。 -Bytes form the basis of the digital world and thus are a promising building -block for multimodal foundation models. Recently, Byte Language Models (BLMs) -have emerged to overcome tokenization, yet the excessive length of bytestreams -requires new architectural paradigms. Therefore, we present the Multiscale Byte -Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows -training with context windows of $5$M bytes on single GPU in full model -precision. We thoroughly examine MBLM's performance with Transformer and Mamba -blocks on both unimodal and multimodal tasks. Our experiments demonstrate that -hybrid architectures are efficient in handling extremely long byte sequences -during training while achieving near-linear generational efficiency. To the -best of our knowledge, we present the first evaluation of BLMs on visual Q\&A -tasks and find that, despite serializing images and the absence of an encoder, -a MBLM with pure next token prediction can match custom CNN-LSTM architectures -with designated classification heads. We show that MBLMs exhibit strong -adaptability in integrating diverse data representations, including pixel and -image filestream bytes, underlining their potential toward omnimodal foundation -models. Source code is publicly available at: -https://github.com/ai4sd/multiscale-byte-lm +##### **The Explanation Necessity for Healthcare AI** +2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling + +Explainability is often critical to the acceptable implementation of +artificial intelligence (AI). Nowhere is this more important than healthcare +where decision-making directly impacts patients and trust in AI systems is +essential. This trust is often built on the explanations and interpretations +the AI provides. Despite significant advancements in AI interpretability, there +remains the need for clear guidelines on when and to what extent explanations +are necessary in the medical context. We propose a novel categorization system +with four distinct classes of explanation necessity, guiding the level of +explanation required: patient or sample (local) level, cohort or dataset +(global) level, or both levels. We introduce a mathematical formulation that +distinguishes these categories and offers a practical framework for researchers +to determine the necessity and depth of explanations required in medical AI +applications. Three key factors are considered: the robustness of the +evaluation protocol, the variability of expert observations, and the +representation dimensionality of the application. In this perspective, we +address the question: When does an AI medical application need to be explained, +and at what level of detail? -摘要:位元組構成數位世界的基礎,因此是多模態基礎模型的一個有前途的建構模組。最近,位元組語言模型 (BLM) 已應運而生,以克服標記化,但位元組串流的過長需要新的架構範例。因此,我們提出多尺度位元組語言模型 (MBLM),這是一個與模型無關的分層解碼器堆疊,允許在單一 GPU 上以完整的模型精度訓練 500 萬位元組的內容視窗。我們徹底檢驗了 MBLM 在單模態和多模態任務上使用 Transformer 和 Mamba 區塊的效能。我們的實驗證明,混合架構在處理訓練期間極長的位元組序列時很有效率,同時達到近乎線性的生成效率。據我們所知,我們提出在視覺問答任務上對 BLM 的首次評估,並發現,儘管序列化影像且沒有編碼器,但具有純粹下一個標記預測的 MBLM 可以匹配具有指定分類標頭的客製化 CNN-LSTM 架構。我們表明,MBLM 在整合各種資料表示形式方面表現出強大的適應性,包括像素和影像檔案串流位元組,強調它們朝向全模態基礎模型的潛力。原始碼已公開於: -https://github.com/ai4sd/multiscale-byte-lm +摘要:可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域,这一点尤为重要,因为决策直接影响患者,并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展,但仍然需要明确的指导方针,说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统,该系统具有四种不同的解释必要性类别,指导所需的解释级别:患者或样本(局部)级别、队列或数据集(全局)级别,或两个级别。我们引入了一个数学公式,该公式区分了这些类别,并为研究人员提供了一个实用框架,以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素:评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看,我们解决了这个问题:AI 医疗应用何时需要解释,以及需要解释到何种程度? -##### **Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks** -2502.14546v1 by Maya Bechler-Speicher, Ben Finkelshtein, Fabrizio Frasca, Luis Müller, Jan Tönshoff, Antoine Siraudin, Viktor Zaverkin, Michael M. Bronstein, Mathias Niepert, Bryan Perozzi, Mikhail Galkin, Christopher Morris +##### **Interdisciplinary Expertise to Advance Equitable Explainable AI** +2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles -While machine learning on graphs has demonstrated promise in drug design and -molecular property prediction, significant benchmarking challenges hinder its -further progress and relevance. Current benchmarking practices often lack focus -on transformative, real-world applications, favoring narrow domains like -two-dimensional molecular graphs over broader, impactful areas such as -combinatorial optimization, relational databases, or chip design. Additionally, -many benchmark datasets poorly represent the underlying data, leading to -inadequate abstractions and misaligned use cases. Fragmented evaluations and an -excessive focus on accuracy further exacerbate these issues, incentivizing -overfitting rather than fostering generalizable insights. These limitations -have prevented the development of truly useful graph foundation models. This -position paper calls for a paradigm shift toward more meaningful benchmarks, -rigorous evaluation protocols, and stronger collaboration with domain experts -to drive impactful and reliable advances in graph learning research, unlocking -the potential of graph learning. +The field of artificial intelligence (AI) is rapidly influencing health and +healthcare, but bias and poor performance persists for populations who face +widespread structural oppression. Previous work has clearly outlined the need +for more rigorous attention to data representativeness and model performance to +advance equity and reduce bias. However, there is an opportunity to also +improve the explainability of AI by leveraging best practices of social +epidemiology and health equity to help us develop hypotheses for associations +found. In this paper, we focus on explainable AI (XAI) and describe a framework +for interdisciplinary expert panel review to discuss and critically assess AI +model explanations from multiple perspectives and identify areas of bias and +directions for future research. We emphasize the importance of the +interdisciplinary expert panel to produce more accurate, equitable +interpretations which are historically and contextually informed. +Interdisciplinary panel discussions can help reduce bias, identify potential +confounders, and identify opportunities for additional research where there are +gaps in the literature. In turn, these insights can suggest opportunities for +AI model improvement. -摘要:儘管圖形上的機器學習在藥物設計和分子屬性預測方面已展現潛力,但顯著的基準挑戰阻礙了其進一步進展和相關性。目前的基準實務往往缺乏對轉型性、真實世界應用的關注,偏好於狹窄的領域,例如二維分子圖形,而不是組合最佳化、關係資料庫或晶片設計等更廣泛、更有影響力的領域。此外,許多基準資料集無法充分表示基礎資料,導致抽象化不充分和使用案例錯位。支離破碎的評估和過度關注準確性進一步加劇了這些問題,激勵過度擬合,而不是培養可概括的見解。這些限制阻礙了真正有用的圖形基礎模型的開發。這篇立場文件呼籲將範例轉變為更有意義的基準、嚴格的評估協定,以及與領域專家的更強大合作,以推動圖形學習研究中具有影響力和可靠性的進展,釋放圖形學習的潛力。 +摘要:人工智慧 (AI) 領域正快速影響著健康與醫療保健,但對於面臨廣泛結構性壓迫的人群來說,偏見和不良表現依然存在。先前的研究已清楚說明,需要更嚴格地注意資料代表性和模型效能,以促進公平性並減少偏見。然而,我們有機會透過運用社會流行病學和健康公平的最佳實務,來改善 AI 的可解釋性,以幫助我們針對發現的關聯性,發展假設。在本文中,我們專注於可解釋 AI (XAI),並描述一個跨領域專家小組審查架構,以從多重觀點討論和批判性評估 AI 模型的解釋,並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要,而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素,並在文獻中有缺口時找出額外研究的機會。反過來,這些見解可以建議 AI 模型改進的機會。 -##### **LLM-based User Profile Management for Recommender System** -2502.14541v1 by Seunghwan Bang, Hwanjun Song +##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts** +2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen -The rapid advancement of Large Language Models (LLMs) has opened new -opportunities in recommender systems by enabling zero-shot recommendation -without conventional training. Despite their potential, most existing works -rely solely on users' purchase histories, leaving significant room for -improvement by incorporating user-generated textual data, such as reviews and -product descriptions. Addressing this gap, we propose PURE, a novel LLM-based -recommendation framework that builds and maintains evolving user profiles by -systematically extracting and summarizing key information from user reviews. -PURE consists of three core components: a Review Extractor for identifying user -preferences and key product features, a Profile Updater for refining and -updating user profiles, and a Recommender for generating personalized -recommendations using the most current profile. To evaluate PURE, we introduce -a continuous sequential recommendation task that reflects real-world scenarios -by adding reviews over time and updating predictions incrementally. Our -experimental results on Amazon datasets demonstrate that PURE outperforms -existing LLM-based methods, effectively leveraging long-term user information -while managing token limitations. +Artificial Intelligence (AI) repeatedly match or outperform radiologists in +lab experiments. However, real-world implementations of radiological AI-based +systems are found to provide little to no clinical value. This paper explores +how to design AI for clinical usefulness in different contexts. We conducted 19 +design sessions and design interventions with 13 radiologists from 7 clinical +sites in Denmark and Kenya, based on three iterations of a functional AI-based +prototype. Ten sociotechnical dependencies were identified as crucial for the +design of AI in radiology. We conceptualised four technical dimensions that +must be configured to the intended clinical context of use: AI functionality, +AI medical focus, AI decision threshold, and AI Explainability. We present four +design recommendations on how to address dependencies pertaining to the medical +knowledge, clinic type, user expertise level, patient context, and user +situation that condition the configuration of these technical dimensions. -摘要:大型語言模型 (LLM) 的快速進步為推薦系統開啟了新的機會,它能實現零次學習推薦,而無需傳統訓練。儘管有潛力,但現有的大部分工作僅依賴於使用者的購買記錄,透過納入使用者產生的文字資料,例如評論和產品說明,仍有很大的改進空間。針對此差距,我們提出 PURE,一個新穎的基於 LLM 的推薦架構,透過系統性地從使用者評論中提取和總結關鍵資訊,建立並維護不斷演進的使用者檔案。PURE 由三個核心組成部分組成:一個評論萃取器,用於識別使用者的喜好和產品主要功能;一個檔案更新器,用於精煉和更新使用者檔案;一個推薦器,用於使用最新的檔案產生個人化推薦。為了評估 PURE,我們引入一個連續順序推薦任務,透過隨著時間新增評論和遞增更新預測,反映真實世界的場景。我們在 Amazon 資料集上的實驗結果證明,PURE 優於現有的基於 LLM 的方法,在管理符號限制的同時,有效地利用長期使用者資訊。 +摘要:人工智慧(AI)在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而,發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代,在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向,必須根據預期的臨床使用情境進行設定:AI 功能、AI 醫療重點、AI 決策門檻,以及 AI 可解釋性。我們提出四項設計建議,說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境,以及影響這些技術面向設定的使用者情境相關的依賴關係。 -##### **LoRA-GGPO: Mitigating Double Descent in LoRA Fine-Tuning via Gradient-Guided Perturbation Optimization** -2502.14538v1 by Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu +##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making** +2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah -Large Language Models (LLMs) have achieved remarkable success in natural -language processing, but their full fine-tuning remains resource-intensive. -Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation -(LoRA), have emerged as a practical solution by approximating parameter updates -with low-rank matrices. However, LoRA often exhibits a "double descent" -phenomenon during fine-tuning, where model performance degrades due to -overfitting and limited expressiveness caused by low-rank constraints. To -address this issue, we propose LoRA-GGPO (Gradient-Guided Perturbation -Optimization), a novel method that leverages gradient and weight norms to -generate targeted perturbations. By optimizing the sharpness of the loss -landscape, LoRA-GGPO guides the model toward flatter minima, mitigating the -double descent problem and improving generalization. Extensive experiments on -natural language understanding (NLU) and generation (NLG) tasks demonstrate -that LoRA-GGPO outperforms LoRA and its state-of-the-art variants. Furthermore, -extended experiments specifically designed to analyze the double descent -phenomenon confirm that LoRA-GGPO effectively alleviates this issue, producing -more robust and generalizable models. Our work provides a robust and efficient -solution for fine-tuning LLMs, with broad applicability in real-world -scenarios. The code is available at https://github.com/llm172/LoRA-GGPO. +With advanced AI/ML, there has been growing research on explainable AI (XAI) +and studies on how humans interact with AI and XAI for effective human-AI +collaborative decision-making. However, we still have a lack of understanding +of how AI systems and XAI should be first presented to users without technical +backgrounds. In this paper, we present the findings of semi-structured +interviews with health professionals (n=12) and students (n=4) majoring in +medicine and health to study how to improve onboarding with AI and XAI. For the +interviews, we built upon human-AI interaction guidelines to create onboarding +materials of an AI system for stroke rehabilitation assessment and AI +explanations and introduce them to the participants. Our findings reveal that +beyond presenting traditional performance metrics on AI, participants desired +benchmark information, the practical benefits of AI, and interaction trials to +better contextualize AI performance, and refine the objectives and performance +of AI. Based on these findings, we highlight directions for improving +onboarding with AI and XAI and human-AI collaborative decision-making. -摘要:大型語言模型 (LLM) 在自然語言處理方面取得了顯著的成功,但它們的完全微調仍然需要大量資源。參數高效微調 (PEFT) 方法(例如低秩適應 (LoRA))已成為一種實用的解決方案,它通過低秩矩陣近似參數更新。然而,LoRA 在微調過程中經常表現出「雙重下降」現象,其中模型性能會因過度擬合和低秩約束導致的表達能力有限而下降。為了解決這個問題,我們提出了 LoRA-GGPO(梯度引導擾動優化),這是一種利用梯度和權重範數來產生目標擾動的新方法。通過優化損失函數曲面的陡度,LoRA-GGPO 引導模型朝向更平坦的最小值,從而減輕雙重下降問題並改善泛化能力。在自然語言理解 (NLU) 和生成 (NLG) 任務中進行的廣泛實驗表明,LoRA-GGPO 優於 LoRA 及其最先進的變體。此外,專門設計用於分析雙重下降現象的延伸實驗證實,LoRA-GGPO 有效地緩解了這個問題,產生了更強大且更具泛化能力的模型。我們的研究為微調 LLM 提供了一個強大且高效的解決方案,在現實世界場景中具有廣泛的適用性。代碼可在 https://github.com/llm172/LoRA-GGPO 獲得。 +摘要:隨著先進的 AI/ML,對可解釋 AI (XAI) 的研究不斷增加,以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而,我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中,我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果,以研究如何改善 AI 和 XAI 的入門。對於訪談,我們建立在人機互動準則之上,為中風康復評估和 AI 解釋的 AI 系統創建入門材料,並將它們介紹給參與者。我們的研究結果表明,除了呈現傳統的 AI 性能指標外,參與者還希望基准信息、AI 的實際好處以及交互試驗,以更好地將 AI 性能情境化,並完善 AI 的目標和性能。根據這些發現,我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。 -##### **CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models** -2502.14529v1 by Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo +##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach** +2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao -Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated -remarkable real-world capabilities, effectively collaborating to complete -complex tasks. While these systems are designed with safety mechanisms, such as -rejecting harmful instructions through alignment, their security remains -largely unexplored. This gap leaves LLM-MASs vulnerable to targeted -disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks -(Corba), a novel and simple yet highly effective attack that disrupts -interactions between agents within an LLM-MAS. Corba leverages two key -properties: its contagious nature allows it to propagate across arbitrary -network topologies, while its recursive property enables sustained depletion of -computational resources. Notably, these blocking attacks often involve -seemingly benign instructions, making them particularly challenging to mitigate -using conventional alignment methods. We evaluate Corba on two widely-used -LLM-MASs, namely, AutoGen and Camel across various topologies and commercial -models. Additionally, we conduct more extensive experiments in open-ended -interactive LLM-MASs, demonstrating the effectiveness of Corba in complex -topology structures and open-source models. Our code is available at: -https://github.com/zhrli324/Corba. +This article uses machine learning (ML) and explainable artificial +intelligence (XAI) techniques to investigate the relationship between +nutritional status and mortality rates associated with Alzheimers disease (AD). +The Third National Health and Nutrition Examination Survey (NHANES III) +database is employed for analysis. The random forest model is selected as the +base model for XAI analysis, and the Shapley Additive Explanations (SHAP) +method is used to assess feature importance. The results highlight significant +nutritional factors such as serum vitamin B12 and glycated hemoglobin. The +study demonstrates the effectiveness of random forests in predicting AD +mortality compared to other diseases. This research provides insights into the +impact of nutrition on AD and contributes to a deeper understanding of disease +progression. -摘要:基於大型語言模型的多主體系統(LLM-MAS)已展現出卓越的真實世界能力,有效地協作以完成複雜任務。儘管這些系統設計有安全機制,例如透過對齊拒絕有害指令,但其安全性仍未得到充分探討。此一缺口讓 LLM-MAS 易受針對性的破壞。在本文中,我們介紹了傳染性遞迴封鎖攻擊(Corba),這是一種新穎且簡單但極為有效的攻擊,會破壞 LLM-MAS 中主體之間的互動。Corba 利用了兩個關鍵特性:其傳染性使其能夠在任意網路拓撲中傳播,而其遞迴特性則能持續耗盡運算資源。值得注意的是,這些封鎖攻擊通常涉及看似良性的指令,這使得使用傳統對齊方法來減輕攻擊特別具有挑戰性。我們在兩個廣泛使用的 LLM-MAS,即 AutoGen 和 Camel 上評估了 Corba,涵蓋了各種拓撲和商業模型。此外,我們在開放式互動 LLM-MAS 中進行了更廣泛的實驗,證明了 Corba 在複雜拓撲結構和開源模型中的有效性。我們的程式碼可在以下網址取得:https://github.com/zhrli324/Corba。 +摘要:本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型,並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素,例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解,並有助於更深入地了解疾病的進展。 -##### **Small Graph Is All You Need: DeepStateGNN for Scalable Traffic Forecasting** -2502.14525v1 by Yannick Wölker, Arash Hajisafi, Cyrus Shahabi, Matthias Renz +##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone** +2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath -We propose a novel Graph Neural Network (GNN) model, named DeepStateGNN, for -analyzing traffic data, demonstrating its efficacy in two critical tasks: -forecasting and reconstruction. Unlike typical GNN methods that treat each -traffic sensor as an individual graph node, DeepStateGNN clusters sensors into -higher-level graph nodes, dubbed Deep State Nodes, based on various similarity -criteria, resulting in a fixed number of nodes in a Deep State graph. The term -"Deep State" nodes is a play on words, referencing hidden networks of power -that, like these nodes, secretly govern traffic independently of visible -sensors. These Deep State Nodes are defined by several similarity factors, -including spatial proximity (e.g., sensors located nearby in the road network), -functional similarity (e.g., sensors on similar types of freeways), and -behavioral similarity under specific conditions (e.g., traffic behavior during -rain). This clustering approach allows for dynamic and adaptive node grouping, -as sensors can belong to multiple clusters and clusters may evolve over time. -Our experimental results show that DeepStateGNN offers superior scalability and -faster training, while also delivering more accurate results than competitors. -It effectively handles large-scale sensor networks, outperforming other methods -in both traffic forecasting and reconstruction accuracy. +Primary care providers are vital for initial triage and referrals to +specialty care. In glaucoma, asymptomatic and fast progression can lead to +vision loss, necessitating timely referrals to specialists. However, primary +eye care providers may not identify urgent cases, potentially delaying care. +Artificial Intelligence (AI) offering explanations could enhance their referral +decisions. We investigate how various AI explanations help providers +distinguish between patients needing immediate or non-urgent specialist +referrals. We built explainable AI algorithms to predict glaucoma surgery needs +from routine eyecare data as a proxy for identifying high-risk patients. We +incorporated intrinsic and post-hoc explainability and conducted an online +study with optometrists to assess human-AI team performance, measuring referral +accuracy and analyzing interactions with AI, including agreement rates, task +time, and user experience perceptions. AI support enhanced referral accuracy +among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams +underperformed compared to AI alone. Participants believed they included AI +advice more when using the intrinsic model, and perceived it more useful and +promising. Without explanations, deviations from AI recommendations increased. +AI support did not increase workload, confidence, and trust, but reduced +challenges. On a separate test set, our black-box and intrinsic models achieved +an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We +identify opportunities of human-AI teaming for glaucoma management in primary +eye care, noting that while AI enhances referral accuracy, it also shows a +performance gap compared to AI alone, even with explanations. Human involvement +remains essential in medical decision making, underscoring the need for future +research to optimize collaboration, ensuring positive experiences and safe AI +use. -摘要:我們提出一個名為 DeepStateGNN 的新穎圖形神經網路 (GNN) 模型,用於分析交通數據,並展示其在兩個關鍵任務中的效能:預測和重建。與將每個交通感測器視為個別圖形節點的典型 GNN 方法不同,DeepStateGNN 會根據各種相似性準則將感測器群集到較高層級的圖形節點中,稱為 Deep State 節點,這會在 Deep State 圖形中產生固定數量的節點。「Deep State」節點這個術語是文字遊戲,指的是隱藏的權力網路,就像這些節點一樣,秘密地獨立於可見感測器管理交通。這些 Deep State 節點由幾個相似性因素定義,包括空間接近性(例如,位於道路網路中附近的感測器)、功能相似性(例如,位於類似類型高速公路上的感測器)以及特定條件下的行為相似性(例如,雨中的交通行為)。這種群集方法允許動態和自適應節點分組,因為感測器可以屬於多個群集,而且群集可能會隨著時間演變。我們的實驗結果顯示,DeepStateGNN 提供了卓越的可擴充性和更快的訓練速度,同時也比競爭對手提供了更準確的結果。它有效地處理了大規模感測器網路,在交通預測和重建準確度方面都優於其他方法。 +摘要:初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下,無症狀且快速惡化可能導致視力喪失,因此需要及時轉診給專家。然而,初級眼科保健提供者可能無法識別緊急情況,可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法,以從例行眼科護理資料預測青光眼手術需求,作為識別高風險患者的代理。我們納入了內在和事後解釋性,並與驗光師進行了一項線上研究,以評估人機團隊的表現,衡量轉診準確度並分析與 AI 的互動,包括同意率、任務時間和使用者體驗感知。在 87 名參與者中,AI 支援提高了轉診準確度(使用 AI/未使用的比例為 59.9%/50.8%),儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議,並認為它更有用且更有希望。沒有解釋,AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任,但減少了挑戰。在一個單獨的測試集中,我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中,人機團隊合作管理青光眼的機會,並注意到雖然 AI 提高了轉診準確度,但即使有解釋,它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要,這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。 -##### **Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation** -2502.14523v1 by Austin A. Barr, Robert Rozman, Eddie Guo +##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery** +2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang -We propose a new framework for zero-shot generation of synthetic tabular -data. Using the large language model (LLM) GPT-4o and plain-language prompting, -we demonstrate the ability to generate high-fidelity tabular data without -task-specific fine-tuning or access to real-world data (RWD) for pre-training. -To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated -synthetic data against data generated with the conditional tabular generative -adversarial network (CTGAN), across three open-access datasets: Iris, Fish -Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o -outperformed CTGAN in preserving means, 95% confidence intervals, bivariate -correlations, and data privacy of RWD, even at amplified sample sizes. Notably, -correlations between parameters were consistently preserved with appropriate -direction and strength. However, refinement is necessary to better retain -distributional characteristics. These findings highlight the potential of LLMs -in tabular data synthesis, offering an accessible alternative to generative -adversarial networks and variational autoencoders. +In medical imaging, particularly in early disease detection and prognosis +tasks, discerning the rationale behind an AI model's predictions is crucial for +evaluating the reliability of its decisions. Conventional explanation methods +face challenges in identifying discernible decisive features in medical image +classifications, where discriminative features are subtle or not immediately +apparent. To bridge this gap, we propose an explainable model that is equipped +with both decision reasoning and feature identification capabilities. Our +approach not only detects influential image patterns but also uncovers the +decisive features that drive the model's final predictions. By implementing our +method, we can efficiently identify and visualise class-specific features +leveraged by the data-driven model, providing insights into the decision-making +processes of deep learning models. We validated our model in the demanding +realm of medical prognosis task, demonstrating its efficacy and potential in +enhancing the reliability of AI in healthcare and in discovering new knowledge +in diseases where prognostic understanding is limited. + +摘要:在醫學影像中,特別是在早期疾病檢測和預後任務中,辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰,其中區別性特徵很微妙或並不明顯。為了彌合這一差距,我們提出了一個可解釋的模型,該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式,還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型,我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵,從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型,展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。 + +##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach** +2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat + +This study explores the relationship between informational support seeking +questions, responses, and helpfulness ratings in online health communities. We +created a labeled data set of question-response pairs and developed multimodal +machine learning and deep learning models to reliably predict informational +support questions and responses. We employed explainable AI to reveal the +emotions embedded in informational support exchanges, demonstrating the +importance of emotion in providing informational support. This complex +interplay between emotional and informational support has not been previously +researched. The study refines social support theory and lays the groundwork for +the development of user decision aids. Further implications are discussed. -摘要:我們提出一個新的架構,用於合成表格資料的零次學習產生。利用大型語言模型 (LLM) GPT-4o 和自然語言提示,我們證明了在沒有特定任務微調或取得真實世界資料 (RWD) 進行預訓練的情況下,產生高保真表格資料的能力。為了對 GPT-4o 進行基準測試,我們比較了 LLM 生成的合成資料與使用條件表格生成對抗網路 (CTGAN) 生成的資料在保真度和隱私性方面的表現,比較對象是三個開放取用的資料集:鳶尾花、魚類測量和房地產估價。儘管採用零次學習方法,GPT-4o 在保留平均值、95% 信賴區間、二元關聯和 RWD 的資料隱私方面都優於 CTGAN,即使在擴增的樣本大小下也是如此。值得注意的是,參數之間的關聯始終保持適當的方向和強度。然而,需要進行改進以更好地保留分佈特徵。這些發現突顯了 LLM 在表格資料合成中的潛力,為生成對抗網路和變異自動編碼器提供了可行的替代方案。 +摘要:本研究探討線上健康社群中尋求資訊支持的問題、回應,以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集,並開發了多模態機器學習和深度學習模型,以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒,證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論,並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。 -##### **MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality** -2502.14509v1 by Artur Kot, Mikołaj Koszowski, Wojciech Chojnowski, Mieszko Rutkowski, Artur Nowakowski, Kamil Guttmann, Mikołaj Pokrywka +##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education** +2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis -Does multilingual Neural Machine Translation (NMT) lead to The Curse of the -Multlinguality or provides the Cross-lingual Knowledge Transfer within a -language family? In this study, we explore multiple approaches for extending -the available data-regime in NMT and we prove cross-lingual benefits even in -0-shot translation regime for low-resource languages. With this paper, we -provide state-of-the-art open-source NMT models for translating between -selected Slavic languages. We released our models on the HuggingFace Hub -(https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) under -the CC BY 4.0 license. Slavic language family comprises morphologically rich -Central and Eastern European languages. Although counting hundreds of millions -of native speakers, Slavic Neural Machine Translation is under-studied in our -opinion. Recently, most NMT research focuses either on: high-resource languages -like English, Spanish, and German - in WMT23 General Translation Task 7 out of -8 task directions are from or to English; massively multilingual models -covering multiple language groups; or evaluation techniques. +In the era of exponential technology growth, one unexpected guest has claimed +a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as +ChatGPT, promises a revolution in education, yet it arrives with a double-edged +sword. Its potential for personalized learning is offset by issues of cheating, +inaccuracies, and educators struggling to incorporate it effectively into their +lesson design. We are standing on the brink of this educational frontier, and +it is clear that we need to navigate this terrain with a lot of care. This is a +major challenge that could undermine the integrity and value of our educational +process. So, how can we turn these challenges into opportunities? When used +inappropriately, AI tools can become the perfect tool for the cut copy paste +mentality, and quickly begin to corrode critical thinking, creativity, and deep +understanding, the most important skills in our rapidly changing world. +Teachers feel that they are not equipped to leverage this technology, widening +the digital divide among educators and institutions. Addressing these concerns +calls for an in depth research approach. We will employ empirical research, +drawing on the Technology Acceptance Model, to assess the attitudes toward +generative AI among educators and students. Understanding their perceptions, +usage patterns, and hurdles is the first crucial step in creating an effective +solution. The present study will be used as a process manual for future +researchers to apply, running their own data, based on the steps explained here -摘要:多語言神經機器翻譯 (NMT) 是否會導致多語言的詛咒,或在語言家族中提供跨語言知識轉移?在這項研究中,我們探討了多種擴展 NMT 中可用資料範圍的方法,並證明了即使在低資源語言的零次學習翻譯中也有跨語言的優點。透過這篇論文,我們提供了最先進的開源 NMT 模型,用於翻譯選定的斯拉夫語。我們在 HuggingFace Hub (https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) 下根據 CC BY 4.0 授權發布我們的模型。斯拉夫語系包含形態豐富的中歐和東歐語言。儘管擁有數億母語人士,但我們認為斯拉夫神經機器翻譯的研究不足。最近,大多數 NMT 研究都專注於:高資源語言,例如英語、西班牙語和德語 - 在 WMT23 一般翻譯任務中,8 個任務方向中有 7 個來自英語或翻譯成英語;涵蓋多個語言群組的大規模多語言模型;或評估技術。 +摘要:在科技飛速發展的時代,一位意外的訪客已在全球教室中佔有一席之地,那就是人工智慧。生成式 AI,例如 ChatGPT,承諾在教育領域掀起一場革命,但它卻是一把雙面刃。它在個人化學習方面的潛力,卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣,顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰,可能會損害我們教育過程的完整性和價值。那麼,我們如何將這些挑戰轉化為機遇?當不適當地使用時,AI 工具可能會成為複製貼上心態的完美工具,並迅速腐蝕批判性思維、創造力和深入理解,這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術,這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究,借鑑技術接受模型,來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊,根據此處說明的步驟運行他們自己的數據 -##### **Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases** -2502.14507v1 by Rena Gao, Xuetong Wu, Tatsuki Kuribayashi, Mingrui Ye, Siya Qi, Carsten Roever, Yuanxing Liu, Zheng Yuan, Jey Han Lau +##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data** +2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk -This study evaluates Large Language Models' (LLMs) ability to simulate -non-native-like English use observed in human second language (L2) learners -interfered with by their native first language (L1). In dialogue-based -interviews, we prompt LLMs to mimic L2 English learners with specific L1s -(e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to -real L2 learner data. Our analysis examines L1-driven linguistic biases, such -as reference word usage and avoidance behaviors, using information-theoretic -and distributional density measures. Results show that modern LLMs (e.g., -Qwen2.5, LLAMA3.3, DeepseekV3, GPT-4o) replicate L1-dependent patterns observed -in human L2 data, with distinct influences from various languages (e.g., -Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu -influences noun-verb collocations). Our results reveal the potential of LLMs -for L2 dialogue generation and evaluation for future educational applications. +With the digitalization of health care systems, artificial intelligence +becomes more present in medicine. Especially machine learning shows great +potential for complex tasks such as time series classification, usually at the +cost of transparency and comprehensibility. This leads to a lack of trust by +humans and thus hinders its active usage. Explainable artificial intelligence +tries to close this gap by providing insight into the decision-making process, +the actual usefulness of its different methods is however unclear. This paper +proposes a user study based evaluation of the explanation method Grad-CAM with +application to a neural network for the classification of breaths in time +series neonatal ventilation data. We present the perceived usefulness of the +explainability method by different stakeholders, exposing the difficulty to +achieve actual transparency and the wish for more in-depth explanations by many +of the participants. -摘要:本研究評估大型語言模型 (LLM) 模擬非母語英語使用者的能力,這些使用者會受到母語 (L1) 干擾,而母語是第二語言 (L2) 學習者。在基於對話的訪談中,我們提示 LLM 模仿具有特定 L1(例如日語、泰語、烏爾都語)的 L2 英語學習者,並比較七種語言的輸出與真實的 L2 學習者資料。我們的分析使用資訊理論和分佈密度測量來檢視 L1 驅動的語言偏差,例如參考詞使用和避免行為。結果顯示,現代 LLM(例如 Qwen2.5、LLAMA3.3、DeepseekV3、GPT-4o)複製了在人類 L2 資料中觀察到的 L1 相依模式,並受到各種語言的明顯影響(例如,日語、韓語和普通話顯著影響時態一致性,而烏爾都語影響名詞動詞搭配)。我們的結果揭示了 LLM 在 L2 對話產生和評估方面的潛力,可供未來教育應用使用。 +摘要:隨著醫療保健系統的數位化,人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力,但通常是以透明度和可理解性為代價。這導致人類缺乏信任,從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距,但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估,其中包含了 Grad-CAM 解釋方法,並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用,揭示了實現實際透明度的難度,以及許多參與者希望獲得更深入的解釋。 -##### **PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models** -2502.14504v1 by Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang +##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare** +2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio -Large Vision-Language Models (LVLMs) have demonstrated remarkable -capabilities across a range of multimodal tasks. However, their inference -efficiency is constrained by the large number of visual tokens processed during -decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token -Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level -Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the -Vision Token Re-attention phenomenon across decoder layers, we dynamically -adjust token retention rates layer by layer. Layers that exhibit stronger -attention to visual information preserve more vision tokens, while layers with -lower vision attention are aggressively pruned. Furthermore, PLPHP applies -pruning at the attention head level, enabling different heads within the same -layer to independently retain critical context. Experiments on multiple -benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and -reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of -0.46% average performance drop, while also achieving notable performance -improvements in multi-image tasks. These results highlight the effectiveness of -fine-grained token pruning and contribute to advancing the efficiency and -scalability of LVLMs. Our source code will be made publicly available. +The integration of Large Language Models (LLMs) into healthcare diagnostics +offers a promising avenue for clinical decision-making. This study outlines the +development of a novel method for zero-shot/few-shot in-context learning (ICL) +by integrating medical domain knowledge using a multi-layered structured +prompt. We also explore the efficacy of two communication styles between the +user and LLMs: the Numerical Conversational (NC) style, which processes data +incrementally, and the Natural Language Single-Turn (NL-ST) style, which +employs long narrative prompts. + Our study systematically evaluates the diagnostic accuracy and risk factors, +including gender bias and false negative rates, using a dataset of 920 patient +records in various few-shot scenarios. Results indicate that traditional +clinical machine learning (ML) models generally outperform LLMs in zero-shot +and few-shot settings. However, the performance gap narrows significantly when +employing few-shot examples alongside effective explainable AI (XAI) methods as +sources of domain knowledge. Moreover, with sufficient time and an increased +number of examples, the conversational style (NC) nearly matches the +performance of ML models. Most notably, LLMs demonstrate comparable or superior +cost-sensitive accuracy relative to ML models. + This research confirms that, with appropriate domain knowledge and tailored +communication strategies, LLMs can significantly enhance diagnostic processes. +The findings highlight the importance of optimizing the number of training +examples and communication styles to improve accuracy and reduce biases in LLM +applications. -摘要:大型視覺語言模型 (LVLMs) 已在各種多模態任務中展現出非凡的能力。然而,其推理效率受到解碼過程中處理的大量視覺符號的限制。為了應對這一挑戰,我們提出逐層逐頭視覺符號剪枝 (PLPHP),這是一種包括層級保留率分配和頭級視覺符號剪枝的兩級細粒度剪枝方法。受解碼器層中視覺符號重新關注現象的啟發,我們動態地逐層調整符號保留率。對視覺資訊表現出更強關注力的層保留更多視覺符號,而視覺關注力較低的層則被積極剪枝。此外,PLPHP 在關注頭級別應用剪枝,使同一層中的不同頭部可以獨立保留關鍵上下文。在多個基準測試上的實驗表明,PLPHP 的解碼速度提高了 18%,且將鍵值快取 (KV 快取) 大小減少了 50% 以上,而代價僅為平均效能下降 0.46%,同時還在多影像任務中實現了顯著的效能提升。這些結果突顯了細粒度符號剪枝的有效性,並有助於提升 LVLMs 的效率和可擴充性。我們的原始碼將公開提供。 +摘要:大型語言模型 (LLM) 與醫療診斷整合 +為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發,用於零次學習/少量學習情境學習 (ICL),方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效:數值對話 (NC) 方式,它會逐步處理資料,以及自然語言單回合 (NL-ST) 方式,它會使用長篇敘事提示。 +我們的研究系統性地評估了診斷準確性和風險因子,包括性別偏見和假陰性率,使用了一個包含 920 個患者記錄的資料集,採用各種少量學習情境。結果表明,傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而,當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時,效能差距會顯著縮小。此外,隨著時間充足和範例數量增加,對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是,LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。 +本研究證實,透過適當的領域知識和量身打造的溝通策略,LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性,以提高準確度並減少 LLM 應用中的偏差。 -##### **How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?** -2502.14502v1 by Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov +##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems** +2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho -The performance of Large Language Models (LLMs) on many tasks is greatly -limited by the knowledge learned during pre-training and stored in the model's -parameters. Low-rank adaptation (LoRA) is a popular and efficient training -technique for updating or domain-specific adaptation of LLMs. In this study, we -investigate how new facts can be incorporated into the LLM using LoRA without -compromising the previously learned knowledge. We fine-tuned -Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our -experiments have shown that the best results are obtained when the training -data contains a mixture of known and new facts. However, this approach is still -potentially harmful because the model's performance on external -question-answering benchmarks declines after such fine-tuning. When the -training data is biased towards certain entities, the model tends to regress to -few overrepresented answers. In addition, we found that the model becomes more -confident and refuses to provide an answer in only few cases. These findings -highlight the potential pitfalls of LoRA-based LLM updates and underscore the -importance of training data composition and tuning parameters to balance new -knowledge integration and general model capabilities. +The increasing reliance on Deep Learning models, combined with their inherent +lack of transparency, has spurred the development of a novel field of study +known as eXplainable AI (XAI) methods. These methods seek to enhance the trust +of end-users in automated systems by providing insights into the rationale +behind their decisions. This paper presents a novel approach for measuring user +trust in XAI systems, allowing their refinement. Our proposed metric combines +both performance metrics and trust indicators from an objective perspective. To +validate this novel methodology, we conducted a case study in a realistic +medical scenario: the usage of XAI system for the detection of pneumonia from +x-ray images. -摘要:大型語言模型 (LLM) 在許多任務上的表現受到預訓練期間學到的知識和儲存在模型參數中的知識的極大限制。低階適應 (LoRA) 是一種流行且有效的訓練技術,用於更新或 LLM 的特定領域適應。在這項研究中,我們探討如何使用 LoRA 將新事實納入 LLM,同時不損害先前學到的知識。我們使用不同數量的知識微調 Llama-3.1-8B-instruct。我們的實驗表明,當訓練資料包含已知和新事實的混合時,會獲得最佳結果。然而,這種方法仍然具有潛在的危害性,因為模型在外部問答基準上的表現會在這種微調後下降。當訓練資料偏向於某些實體時,模型傾向於回歸到少數過度表示的答案。此外,我們發現模型變得更有信心,並且在極少數情況下拒絕提供答案。這些發現突顯了基於 LoRA 的 LLM 更新的潛在缺點,並強調了訓練資料組成和調整參數以平衡新知識整合和一般模型能力的重要性。 +摘要:隨著對深度學習模型依賴性的增加,加上其固有的透明度不足,促使一個新的研究領域發展,稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理,來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法,允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法,我們在一個真實的醫療場景中進行了一個案例研究:使用 XAI 系統從 X 光影像中偵測肺炎。 -##### **Towards a Perspectivist Turn in Argument Quality Assessment** -2502.14501v1 by Julia Romberg, Maximilian Maurer, Henning Wachsmuth, Gabriella Lapesa +##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19** +2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao -The assessment of argument quality depends on well-established logical, -rhetorical, and dialectical properties that are unavoidably subjective: -multiple valid assessments may exist, there is no unequivocal ground truth. -This aligns with recent paths in machine learning, which embrace the -co-existence of different perspectives. However, this potential remains largely -unexplored in NLP research on argument quality. One crucial reason seems to be -the yet unexplored availability of suitable datasets. We fill this gap by -conducting a systematic review of argument quality datasets. We assign them to -a multi-layered categorization targeting two aspects: (a) What has been -annotated: we collect the quality dimensions covered in datasets and -consolidate them in an overarching taxonomy, increasing dataset comparability -and interoperability. (b) Who annotated: we survey what information is given -about annotators, enabling perspectivist research and grounding our -recommendations for future actions. To this end, we discuss datasets suitable -for developing perspectivist models (i.e., those containing individual, -non-aggregated annotations), and we showcase the importance of a controlled -selection of annotators in a pilot study. +The COVID-19 pandemic has strained global public health, necessitating +accurate diagnosis and intervention to control disease spread and reduce +mortality rates. This paper introduces an interpretable deep survival +prediction model designed specifically for improved understanding and trust in +COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale +pretrained image encoder, Risk-specific Grad-CAM, and anatomical region +detection techniques, our approach produces regional interpretable outcomes +that effectively capture essential disease features while focusing on rare but +critical abnormal regions. Our model's predictive results provide enhanced +clarity and transparency through risk area localization, enabling clinicians to +make informed decisions regarding COVID-19 diagnosis with better understanding +of prognostic insights. We evaluate the proposed method on a multi-center +survival dataset and demonstrate its effectiveness via quantitative and +qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and +time-dependent AUCs (0.799 and 0.691). These results suggest that our +explainable deep survival prediction model surpasses traditional survival +analysis methods in risk prediction, improving interpretability for clinical +decision making and enhancing AI system trustworthiness. -摘要:論證品質的評估取決於根深蒂固的邏輯、修辭和辯證屬性,這些屬性難免具有主觀性:可能存在多種有效的評估,沒有明確的真實依據。這與機器學習中最近的途徑一致,這些途徑接受了不同觀點的共存。然而,這種潛力在論證品質的 NLP 研究中仍然很大程度上未被探索。一個關鍵原因似乎是尚未探索合適的資料集的可用性。我們通過對論證品質資料集進行系統性回顧來填補這一空白。我們將它們分配到一個多層次分類,針對兩個方面:(a) 已註釋的內容:我們收集資料集中涵蓋的品質維度,並將它們整合到一個總體分類法中,提高資料集的可比性和互操作性。(b) 誰做了註釋:我們調查了關於註釋者的哪些資訊,使觀點主義研究成為可能,並為我們對未來行動的建議奠定基礎。為此,我們討論了適合開發觀點主義模型的資料集(即那些包含個別、非聚合註釋的資料集),並在試驗研究中展示了受控選擇註釋者的重要性。 +摘要:COVID-19 疫情對全球公共衛生造成壓力,必須進行準確的診斷和干預,以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型,專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術,我們的做法產生區域可解釋的結果,有效捕捉必要的疾病特徵,同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度,讓臨床醫生能夠在更了解預後見解的情況下,就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法,並透過量化和質化評估證明其有效性,達到優異的 C 指數(0.764 和 0.727)和時間相關 AUC(0.799 和 0.691)。這些結果表明,我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法,提升臨床決策的解釋性,並增強 AI 系統的信賴度。 -##### **MLGym: A New Framework and Benchmark for Advancing AI Research Agents** -2502.14499v1 by Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu +##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics** +2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile -We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for -evaluating and developing LLM agents on AI research tasks. This is the first -Gym environment for machine learning (ML) tasks, enabling research on -reinforcement learning (RL) algorithms for training such agents. MLGym-bench -consists of 13 diverse and open-ended AI research tasks from diverse domains -such as computer vision, natural language processing, reinforcement learning, -and game theory. Solving these tasks requires real-world AI research skills -such as generating new ideas and hypotheses, creating and processing data, -implementing ML methods, training models, running experiments, analyzing the -results, and iterating through this process to improve on a given task. We -evaluate a number of frontier large language models (LLMs) on our benchmarks -such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 -Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate -models or agents, generate synthetic data at scale, as well as develop new -learning algorithms for training agents on AI research tasks. We find that -current frontier models can improve on the given baselines, usually by finding -better hyperparameters, but do not generate novel hypotheses, algorithms, -architectures, or substantial improvements. We open-source our framework and -benchmark to facilitate future research in advancing the AI research -capabilities of LLM agents. +In recent years, machine learning-based clinical decision support systems +(CDSS) have played a key role in the analysis of several medical conditions. +Despite their promising capabilities, the lack of transparency in AI models +poses significant challenges, particularly in medical contexts where +reliability is a mandatory aspect. However, it appears that explainability is +inversely proportional to accuracy. For this reason, achieving transparency +without compromising predictive accuracy remains a key challenge. This paper +presents a novel method, namely Rad4XCNN, to enhance the predictive power of +CNN-derived features with the inherent interpretability of radiomic features. +Rad4XCNN diverges from conventional methods based on saliency maps, by +associating intelligible meaning to CNN-derived features by means of Radiomics, +offering new perspectives on explanation methods beyond visualization maps. +Using a breast cancer classification task as a case study, we evaluated +Rad4XCNN on ultrasound imaging datasets, including an online dataset and two +in-house datasets for internal and external validation. Some key results are: +i) CNN-derived features guarantee more robust accuracy when compared against +ViT-derived and radiomic features; ii) conventional visualization map methods +for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice +model accuracy for their explainability; iv) Rad4XCNN provides a global +explanation enabling the physician to extract global insights and findings. Our +method can mitigate some concerns related to the explainability-accuracy +trade-off. This study highlighted the importance of proposing new methods for +model explanation without affecting their accuracy. + +摘要:近年来,基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景,但 AI 模型缺乏透明度,尤其在医疗领域,可靠性是强制性方面,这带来了重大挑战。然而,解释性似乎与准确性成反比。因此,在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法,即 Rad4XCNN,以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来,从而偏离了基于显着性图的传统方法,为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究,我们在超声成像数据集上评估了 Rad4XCNN,包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是:i) 与 ViT 衍生和放射组学特征相比,CNN 衍生特征保证了更稳健的准确性;ii) 用于解释的传统可视化图方法存在一些缺陷;iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性;iv) Rad4XCNN 提供全局解释,使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。 + +##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability** +2404.16957v1 by Yunfei Ge, Quanyan Zhu + +The pervasive integration of Artificial Intelligence (AI) has introduced +complex challenges in the responsibility and accountability in the event of +incidents involving AI-enabled systems. The interconnectivity of these systems, +ethical concerns of AI-induced incidents, coupled with uncertainties in AI +technology and the absence of corresponding regulations, have made traditional +responsibility attribution challenging. To this end, this work proposes a +Computational Reflective Equilibrium (CRE) approach to establish a coherent and +ethically acceptable responsibility attribution framework for all stakeholders. +The computational approach provides a structured analysis that overcomes the +limitations of conceptual approaches in dealing with dynamic and multifaceted +scenarios, showcasing the framework's explainability, coherence, and adaptivity +properties in the responsibility attribution process. We examine the pivotal +role of the initial activation level associated with claims in equilibrium +computation. Using an AI-assisted medical decision-support system as a case +study, we illustrate how different initializations lead to diverse +responsibility distributions. The framework offers valuable insights into +accountability in AI-induced incidents, facilitating the development of a +sustainable and resilient system through continuous monitoring, revision, and +reflection. -摘要:我們推出 Meta MLGym 和 MLGym-Bench,一個用於評估和開發 AI 研究任務中 LLM 代理的新架構和基準。這是第一個用於機器學習 (ML) 任務的 Gym 環境,可針對訓練此類代理的強化學習 (RL) 演算法進行研究。MLGym-bench 包含 13 項來自不同領域的開放式 AI 研究任務,例如電腦視覺、自然語言處理、強化學習和博弈論。解決這些任務需要實際的 AI 研究技能,例如產生新想法和假設、建立和處理資料、實作 ML 方法、訓練模型、執行實驗、分析結果,並透過此流程反覆運算來改善特定任務。我們在基準上評估許多前沿大型語言模型 (LLM),例如 Claude-3.5-Sonnet、Llama-3.1 405B、GPT-4o、o1-preview 和 Gemini-1.5 Pro。我們的 MLGym 架構讓新增任務、整合和評估模型或代理、大規模產生合成資料,以及開發新的學習演算法以訓練 AI 研究任務中的代理變得容易。我們發現目前的邊界模型可以改善既定的基準,通常是透過尋找更好的超參數,但不會產生新穎的假設、演算法、架構或實質性的改進。我們開放原始碼架構和基準,以促進未來在提升 LLM 代理的 AI 研究能力方面的研究。 +摘要:隨著人工智慧 (AI) 的普及整合,在涉及 AI 驅動系統的事故中,責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題,加上 AI 技術的不確定性和缺乏相應法規,使得傳統責任歸屬面臨挑戰。為此,本研究提出了一種計算反思均衡 (CRE) 方法,以建立一個連貫且在倫理上可接受的責任歸屬架構,適用於所有利害關係人。計算方法提供了結構化的分析,克服了概念方法在處理動態且多面向情境時的限制,展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究,說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解,透過持續監控、修訂和反思,促進了永續且有韌性的系統發展。 -##### **Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups** -2502.14497v1 by Felix Drinkall, Stefan Zohren, Michael McMahon, Janet B. Pierrehumbert +##### **Explainable AI for Fair Sepsis Mortality Predictive Model** +2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang -Macroeconomic fluctuations and the narratives that shape them form a mutually -reinforcing cycle: public discourse can spur behavioural changes leading to -economic shifts, which then result in changes in the stories that propagate. We -show that shifts in semantic embedding space can be causally linked to -financial market shocks -- deviations from the expected market behaviour. -Furthermore, we show how partisanship can influence the predictive power of -text for market fluctuations and shape reactions to those same shocks. We also -provide some evidence that text-based signals are particularly salient during -unexpected events such as COVID-19, highlighting the value of language data as -an exogenous variable in economic forecasting. Our findings underscore the -bidirectional relationship between news outlets and market shocks, offering a -novel empirical approach to studying their effect on each other. +Artificial intelligence supports healthcare professionals with predictive +modeling, greatly transforming clinical decision-making. This study addresses +the crucial need for fairness and explainability in AI applications within +healthcare to ensure equitable outcomes across diverse patient demographics. By +focusing on the predictive modeling of sepsis-related mortality, we propose a +method that learns a performance-optimized predictive model and then employs +the transfer learning process to produce a model with better fairness. Our +method also introduces a novel permutation-based feature importance algorithm +aiming at elucidating the contribution of each feature in enhancing fairness on +predictions. Unlike existing explainability methods concentrating on explaining +feature contribution to predictive performance, our proposed method uniquely +bridges the gap in understanding how each feature contributes to fairness. This +advancement is pivotal, given sepsis's significant mortality rate and its role +in one-third of hospital deaths. Our method not only aids in identifying and +mitigating biases within the predictive model but also fosters trust among +healthcare stakeholders by improving the transparency and fairness of model +predictions, thereby contributing to more equitable and trustworthy healthcare +delivery. -摘要:宏觀經濟波動與形塑它們的敘事形成一個相互強化的循環:公共論述可能激發導致經濟變化的行為改變,進而導致宣傳故事的改變。我們表明,語義嵌入空間的轉變可能與金融市場震盪(與預期的市場行為的偏差)有因果關係。此外,我們展示了黨派立場如何影響文字對市場波動的預測能力,以及如何形塑對這些震盪的反應。我們還提供了一些證據,證明在 COVID-19 等意外事件期間,基於文字的信號特別顯著,突顯了語言資料在經濟預測中作為外生變數的價值。我們的研究結果強調了新聞媒體與市場震盪之間的雙向關係,提供了一種研究它們對彼此影響的新穎實證方法。 +摘要:人工智慧透過預測模型協助醫療專業人員,大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求,以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型,我們提出了一種方法,該方法會學習一個效能最佳化的預測模型,然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法,旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同,我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要,因為敗血症的死亡率很高,且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差,還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任,進而有助於提供更公平且值得信賴的醫療保健服務。 -##### **Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization** -2502.14496v1 by Zhitao He, Zijun Liu, Peng Li, May Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu +##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence** +2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal -LLM-based agents have made significant advancements in interactive -environments, such as mobile operations and web browsing, and other domains -beyond computer using. Current multi-agent systems universally excel in -performance, compared to single agents, but struggle with generalization across -environments due to predefined roles and inadequate strategies for generalizing -language agents. The challenge of achieving both strong performance and good -generalization has hindered the progress of multi-agent systems for interactive -environments. To address these issues, we propose CollabUIAgents, a multi-agent -reinforcement learning framework with a novel multi-agent credit re-assignment -(CR) strategy, assigning process rewards with LLMs rather than -environment-specific rewards and learning with synthesized preference data, in -order to foster generalizable, collaborative behaviors among the role-free -agents' policies. Empirical results show that our framework improves both -performance and cross-environment generalizability of multi-agent systems. -Moreover, our 7B-parameter system achieves results on par with or exceed strong -closed-source models, and the LLM that guides the CR. We also provide insights -in using granular CR rewards effectively for environment generalization, and -accommodating trained LLMs in multi-agent systems. +Depression is a significant issue nowadays. As per the World Health +Organization (WHO), in 2023, over 280 million individuals are grappling with +depression. This is a huge number; if not taken seriously, these numbers will +increase rapidly. About 4.89 billion individuals are social media users. People +express their feelings and emotions on platforms like Twitter, Facebook, +Reddit, Instagram, etc. These platforms contain valuable information which can +be used for research purposes. Considerable research has been conducted across +various social media platforms. However, certain limitations persist in these +endeavors. Particularly, previous studies were only focused on detecting +depression and the intensity of depression in tweets. Also, there existed +inaccuracies in dataset labeling. In this research work, five types of +depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted +using tweets from the Twitter database based on lexicon labeling. Explainable +AI was used to provide reasoning by highlighting the parts of tweets that +represent type of depression. Bidirectional Encoder Representations from +Transformers (BERT) was used for feature extraction and training. Machine +learning and deep learning methodologies were used to train the model. The BERT +model presented the most promising results, achieving an overall accuracy of +0.96. -摘要:基於 LLM 的代理在互動式環境中取得重大進展,例如行動運算和網頁瀏覽,以及電腦使用以外的其他領域。與單一代理相比,目前的 Multi-Agent 系統在效能上普遍表現出色,但由於預先定義的角色和不適當的語言代理概化策略,導致難以跨環境概化。在互動式環境中,同時達成強大效能和良好概化的挑戰,阻礙了 Multi-Agent 系統的進展。為了解決這些問題,我們提出 CollabUIAgents,這是一個 Multi-Agent 強化學習架構,具備創新的 Multi-Agent 信用重新分配 (CR) 策略,使用 LLM 而不是特定於環境的獎勵來分配程序獎勵,並透過綜合偏好資料進行學習,以促進無角色代理政策之間可概化的協作行為。經驗結果顯示,我們的架構同時改善了 Multi-Agent 系統的效能和跨環境概化能力。此外,我們的 7B 參數系統在效能上與強大的閉源模型和引導 CR 的 LLM 相當或超越它們。我們也提供見解,說明如何有效地使用細粒化的 CR 獎勵來進行環境概化,以及如何在 Multi-Agent 系統中容納受過訓練的 LLM。 +摘要:現今,憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料,在 2023 年,超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字;如果不認真看待,這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊,可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而,這些努力仍存在某些限制。特別是,先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外,資料集標籤中存在不準確的情況。在這項研究工作中,使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症(雙極型、重度、精神病型、非典型和產後)。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers(BERT)中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果,達到 0.96 的整體準確度。 -##### **StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following** -2502.14494v1 by Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu +##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images** +2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman -Multi-turn instruction following capability constitutes a core competency of -large language models (LLMs) in real-world applications. Existing evaluation -benchmarks predominantly focus on fine-grained constraint satisfaction and -domain-specific capability assessment, yet overlook the crucial structural -dependency between dialogue turns that distinguishes multi-turn from -single-turn interactions. This structural dependency not only reflects user -intent but also establishes a second dimension for instruction following -evaluation beyond constraint satisfaction. To address this gap, we propose -StructFlowBench, a multi-turn instruction following benchmark with structural -flow modeling. The benchmark innovatively defines a structural flow framework -comprising six fundamental inter-turn relationships, which not only introduces -novel structural constraints for model evaluation but also serves as generation -parameters for creating customized dialogue flows tailored to specific -scenarios. Adopting established LLM-based automatic evaluation methodologies, -we conduct systematic evaluations of 13 leading open-source and closed-source -LLMs. Experimental results reveal significant deficiencies in current models' -comprehension of multi-turn dialogue structures. The code is available at -\url{https://github.com/MLGroupJLU/StructFlowBench}. +Deep learning is dramatically transforming the field of medical imaging and +radiology, enabling the identification of pathologies in medical images, +including computed tomography (CT) and X-ray scans. However, the performance of +deep learning models, particularly in segmentation tasks, is often limited by +the need for extensive annotated datasets. To address this challenge, the +capabilities of weakly supervised semantic segmentation are explored through +the lens of Explainable AI and the generation of counterfactual explanations. +The scope of this research is development of a novel counterfactual inpainting +approach (COIN) that flips the predicted classification label from abnormal to +normal by using a generative model. For instance, if the classifier deems an +input medical image X as abnormal, indicating the presence of a pathology, the +generative model aims to inpaint the abnormal region, thus reversing the +classifier's original prediction label. The approach enables us to produce +precise segmentations for pathologies without depending on pre-existing +segmentation masks. Crucially, image-level labels are utilized, which are +substantially easier to acquire than creating detailed segmentation masks. The +effectiveness of the method is demonstrated by segmenting synthetic targets and +actual kidney tumors from CT images acquired from Tartu University Hospital in +Estonia. The findings indicate that COIN greatly surpasses established +attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an +alternative counterfactual explanation method introduced by Singla et al. This +evidence suggests that COIN is a promising approach for semantic segmentation +of tumors in CT images, and presents a step forward in making deep learning +applications more accessible and effective in healthcare, where annotated data +is scarce. -摘要:多輪指令遵循能力構成大型語言模型 (LLM) 在現實世界應用中的核心能力。現有的評估基準主要專注於細粒度的約束滿足和特定領域的能力評估,卻忽略了多輪與單輪互動之間區別的關鍵結構依賴性。這種結構依賴性不僅反映了使用者的意圖,也為指令遵循評估建立了超越約束滿足的第二個維度。為了解決這個差距,我們提出了 StructFlowBench,一個具有結構流建模的多輪指令遵循基準。該基準創新地定義了一個結構流框架,包含六個基本的回合間關係,這不僅引入了模型評估的新結構約束,還可用作生成參數,用於創建針對特定場景定制的對話流。採用已建立的基於 LLM 的自動評估方法,我們對 13 個領先的開源和閉源 LLM 進行了系統評估。實驗結果揭示了當前模型在理解多輪對話結構方面存在顯著缺陷。程式碼可在 \url{https://github.com/MLGroupJLU/StructFlowBench} 取得。 +摘要:深度学习正大幅轉變醫學影像和放射線學領域,能辨識醫學影像中的病理,包括電腦斷層掃描 (CT) 和 X 光掃描。然而,深度學習模型的效能,特別是在分割任務中,常常受到廣泛註解資料集需求的限制。為了應對此挑戰,透過可解釋 AI 和反事實解釋的產生,探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN),該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如,如果分類器將輸入的醫學影像 X 視為異常,表示存在病理,則生成模型旨在內插異常區域,從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割,而無需依賴於預先存在的分割遮罩。至關重要的是,利用影像層級標籤,這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明,COIN 遠遠超過已建立的歸因方法,例如 RISE、ScoreCAM 和 LayerCAM,以及 Singla 等人提出的另一種反事實解釋方法。此證據表明,COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法,並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步,其中註解資料很稀少。 -##### **Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk** -2502.14491v1 by Elija Perrier +##### **Hybrid Intelligence for Digital Humanities** +2406.15374v1 by Victor de Boer, Lise Stork -Evaluating AI safety requires statistically rigorous methods and risk metrics -for understanding how the use of AI affects aggregated risk. However, much AI -safety literature focuses upon risks arising from AI models in isolation, -lacking consideration of how modular use of AI affects risk distribution of -workflow components or overall risk metrics. There is also a lack of -statistical grounding enabling sensitisation of risk models in the presence of -absence of AI to estimate causal contributions of AI. This is in part due to -the dearth of AI impact data upon which to fit distributions. In this work, we -address these gaps in two ways. First, we demonstrate how scenario modelling -(grounded in established statistical techniques such as Markov chains, copulas -and Monte Carlo simulation) can be used to model AI risk holistically. Second, -we show how lookalike distributions from phenomena analogous to AI can be used -to estimate AI impacts in the absence of directly observable data. We -demonstrate the utility of our methods for benchmarking cumulative AI risk via -risk analysis of a logistic scenario simulations. +In this paper, we explore the synergies between Digital Humanities (DH) as a +discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research, +the use of digital methods and specifically that of Artificial Intelligence is +subject to a set of requirements and constraints. We argue that these are +well-supported by the capabilities and goals of HI. Our contribution includes +the identification of five such DH requirements: Successful AI systems need to +be able to 1) collaborate with the (human) scholar; 2) support data criticism; +3) support tool criticism; 4) be aware of and cater to various perspectives and +5) support distant and close reading. We take the CARE principles of Hybrid +Intelligence (collaborative, adaptive, responsible and explainable) as +theoretical framework and map these to the DH requirements. In this mapping, we +include example research projects. We finally address how insights from DH can +be applied to HI and discuss open challenges for the combination of the two +disciplines. -摘要:評估 AI 安全性需要嚴格的統計方法和風險指標,以了解 AI 的使用如何影響累積風險。然而,許多 AI 安全性文獻著重於 AI 模型孤立產生的風險,缺乏考量 AI 的模組化使用如何影響工作流程組件的風險分佈或整體風險指標。在有或沒有 AI 的情況下,統計基礎也缺乏讓風險模型敏感化的能力,以估計 AI 的因果關係貢獻。這部分是因為缺乏 AI 影響資料來擬合分佈。在這項研究中,我們以兩種方式解決這些差距。首先,我們展示情境建模(建立在已建立的統計技術上,例如馬可夫鏈、copula 和蒙地卡羅模擬)如何用於整體建模 AI 風險。其次,我們展示如何使用類似於 AI 現象的相似分佈來估計在沒有直接可觀察資料的情況下 AI 的影響。我們透過後勤情境模擬的風險分析,展示了我們的方法對於評量累積 AI 風險的效用。 +摘要:在本文中,我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中,數位方法的使用,特別是人工智慧的使用,受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求:成功的 AI 系統需要能夠 1) 與(人類)學者合作;2) 支援資料批評;3) 支援工具批評;4) 察覺並迎合各種觀點;5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則(協作、適應、負責和可解釋)作為理論架構,並將這些原則對應到 DH 要求。在此對應中,我們納入範例研究專案。最後,我們探討如何將 DH 的見解應用於 HI,並討論結合這兩個學科的開放挑戰。 -##### **Temporal Misalignment and Probabilistic Neurons** -2502.14487v1 by Velibor Bojković, Xiaofeng Wu, Bin Gu +##### **Ethical Framework for Responsible Foundational Models in Medical Imaging** +2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci -Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to -Artificial Neural Networks (ANNs) by mimicking biological neural principles, -establishing them as a promising approach to mitigate the increasing energy -demands of large-scale neural models. However, fully harnessing the -capabilities of SNNs remains challenging due to their discrete signal -processing and temporal dynamics. ANN-SNN conversion has emerged as a practical -approach, enabling SNNs to achieve competitive performance on complex machine -learning tasks. In this work, we identify a phenomenon in the ANN-SNN -conversion framework, termed temporal misalignment, in which random spike -rearrangement across SNN layers leads to performance improvements. Based on -this observation, we introduce biologically plausible two-phase probabilistic -(TPP) spiking neurons, further enhancing the conversion process. We demonstrate -the advantages of our proposed method both theoretically and empirically -through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet -across a variety of architectures, achieving state-of-the-art results. +Foundational models (FMs) have tremendous potential to revolutionize medical +imaging. However, their deployment in real-world clinical settings demands +extensive ethical considerations. This paper aims to highlight the ethical +concerns related to FMs and propose a framework to guide their responsible +development and implementation within medicine. We meticulously examine ethical +issues such as privacy of patient data, bias mitigation, algorithmic +transparency, explainability and accountability. The proposed framework is +designed to prioritize patient welfare, mitigate potential risks, and foster +trust in AI-assisted healthcare. -摘要:脈衝神經網路 (SNN) 模仿生物神經原理,提供了一種比人工神經網路 (ANN) 更省能的替代方案,確立了它們作為緩解大型神經模型日益增長能耗需求的一種有前途的方法。然而,由於 SNN 的離散訊號處理和時間動態,要充分利用 SNN 的功能仍然具有挑戰性。ANN-SNN 轉換已經成為一種實用的方法,使 SNN 能夠在複雜機器學習任務中實現競爭性能。在這項工作中,我們在 ANN-SNN 轉換框架中發現了一種現象,稱為時間錯位,其中隨機脈衝在 SNN 層之間重新排列會導致性能提升。基於這一觀察,我們引入了生物學上合理的兩階段機率 (TPP) 脈衝神經元,進一步增強了轉換過程。我們通過在 CIFAR-10/100、CIFAR10-DVS 和 ImageNet 上對各種架構進行綜合實驗,從理論和經驗上證明了我們提出的方法的優點,取得了最先進的結果。 +摘要:基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而,它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題,並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題,例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險,並培養對 AI 輔助醫療保健的信任。 -##### **How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation** -2502.14486v1 by Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang, Zhongyu Wei +##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis** +2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak -Jailbreak attacks, where harmful prompts bypass generative models' built-in -safety, raise serious concerns about model vulnerability. While many defense -methods have been proposed, the trade-offs between safety and helpfulness, and -their application to Large Vision-Language Models (LVLMs), are not well -understood. This paper systematically examines jailbreak defenses by reframing -the standard generation task as a binary classification problem to assess model -refusal tendencies for both harmful and benign queries. We identify two key -defense mechanisms: safety shift, which increases refusal rates across all -queries, and harmfulness discrimination, which improves the model's ability to -distinguish between harmful and benign inputs. Using these mechanisms, we -develop two ensemble defense strategies-inter-mechanism ensembles and -intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the -MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these -strategies effectively improve model safety or optimize the trade-off between -safety and helpfulness. +Thyroid cancer is an increasing global health concern that requires advanced +diagnostic methods. The application of AI and radiomics to thyroid cancer +diagnosis is examined in this review. A review of multiple databases was +conducted in compliance with PRISMA guidelines until October 2023. A +combination of keywords led to the discovery of an English academic publication +on thyroid cancer and related subjects. 267 papers were returned from the +original search after 109 duplicates were removed. Relevant studies were +selected according to predetermined criteria after 124 articles were eliminated +based on an examination of their abstract and title. After the comprehensive +analysis, an additional six studies were excluded. Among the 28 included +studies, radiomics analysis, which incorporates ultrasound (US) images, +demonstrated its effectiveness in diagnosing thyroid cancer. Various results +were noted, some of the studies presenting new strategies that outperformed the +status quo. The literature has emphasized various challenges faced by AI +models, including interpretability issues, dataset constraints, and operator +dependence. The synthesized findings of the 28 included studies mentioned the +need for standardization efforts and prospective multicenter studies to address +these concerns. Furthermore, approaches to overcome these obstacles were +identified, such as advances in explainable AI technology and personalized +medicine techniques. The review focuses on how AI and radiomics could transform +the diagnosis and treatment of thyroid cancer. Despite challenges, future +research on multidisciplinary cooperation, clinical applicability validation, +and algorithm improvement holds the potential to improve patient outcomes and +diagnostic precision in the treatment of thyroid cancer. -摘要:越獄攻擊,其中有害提示繞過生成模型內建的安全機制,引發了對模型漏洞的嚴重疑慮。雖然已提出許多防禦方法,但安全性與有益性之間的取捨,以及它們在大型視覺語言模型 (LVLMs) 中的應用,尚未得到充分理解。本文透過將標準生成任務重新定義為二元分類問題,系統性地檢視越獄防禦,以評估模型對有害和良性查詢的拒絕傾向。我們找出兩種關鍵的防禦機制:安全轉移,這會提高所有查詢的拒絕率,以及危害區分,這會提升模型區分有害和良性輸入的能力。使用這些機制,我們開發出兩種整體防禦策略,機制間整體和機制內整體,以平衡安全性與有益性。在使用 LLaVA-1.5 模型的 MM-SafetyBench 和 MOSSBench 資料集上進行的實驗顯示,這些策略有效地提升了模型安全性,或最佳化了安全性與有益性之間的取捨。 +摘要:甲狀腺癌是一種日益嚴重的全球健康問題,需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下,對多個資料庫進行了回顧,直到 2023 年 10 月。通過結合關鍵字,發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後,原始搜尋共回傳 267 篇論文。在根據預先確定的標準,淘汰了 124 篇文章的摘要和標題後,選出了相關研究。在進行全面分析後,額外排除了六項研究。在納入的 28 項研究中,結合超音波 (US) 影像的放射特徵分析,證明了其在診斷甲狀腺癌方面的有效性。研究結果不一,有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰,包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到,需要標準化工作和前瞻性多中心研究來解決這些問題。此外,還確定了克服這些障礙的方法,例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰,但未來對多學科合作、臨床適用性驗證和演算法改進的研究,仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。 -##### **NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models** -2502.14482v1 by Chenlu Guo, Yuan Wu, Yi Chang +##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI** +2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia -Parameter-efficient fine-tuning (PEFT) is essential for adapting large -language models (LLMs), with low-rank adaptation (LoRA) being the most popular -approach. However, LoRA suffers from slow convergence, and some recent LoRA -variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD) -for initialization, leading to expensive computation. To mitigate these -problems, we use the Nystr\"om method, which follows a three-matrix -manipulation. We first introduce StructuredLoRA (SLoRA), which investigates -adding a small intermediate matrix between the low-rank matrices A and B. -Secondly, we propose Nystr\"omLoRA (NLoRA), which leverages Nystr\"om-based -initialization for SLoRA to improve its effectiveness and efficiency. Finally, -we propose IntermediateTune (IntTune), which explores fine-tuning exclusively -on the intermediate matrix of NLoRA to further boost LLM efficiency. We -evaluate our methods on five natural language generation (NLG) tasks and eight -natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve -accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with -only 3.67 million additional trainable parameters. IntTune improves average NLG -performance over LoRA by 7.45% while using only 1.25% of its parameters. These -results demonstrate the efficiency and effectiveness of our approach in -enhancing model performance with minimal parameter overhead. +Breast cancer has rapidly increased in prevalence in recent years, making it +one of the leading causes of mortality worldwide. Among all cancers, it is by +far the most common. Diagnosing this illness manually requires significant time +and expertise. Since detecting breast cancer is a time-consuming process, +preventing its further spread can be aided by creating machine-based forecasts. +Machine learning and Explainable AI are crucial in classification as they not +only provide accurate predictions but also offer insights into how the model +arrives at its decisions, aiding in the understanding and trustworthiness of +the classification results. In this study, we evaluate and compare the +classification accuracy, precision, recall, and F-1 scores of five different +machine learning methods using a primary dataset (500 patients from Dhaka +Medical College Hospital). Five different supervised machine learning +techniques, including decision tree, random forest, logistic regression, naive +bayes, and XGBoost, have been used to achieve optimal results on our dataset. +Additionally, this study applied SHAP analysis to the XGBoost model to +interpret the model's predictions and understand the impact of each feature on +the model's output. We compared the accuracy with which several algorithms +classified the data, as well as contrasted with other literature in this field. +After final evaluation, this study found that XGBoost achieved the best model +accuracy, which is 97%. -摘要:參數高效微調 (PEFT) 對於調整大型語言模型 (LLM) 至關重要,其中低秩調整 (LoRA) 是最受歡迎的方法。然而,LoRA 存在收斂速度慢的問題,而一些最近的 LoRA 變體,例如 PiSSA,主要依賴奇異值分解 (SVD) 進行初始化,導致運算成本高昂。為了減輕這些問題,我們使用了 Nystr\"om 方法,它遵循三矩陣操作。我們首先介紹 StructuredLoRA (SLoRA),它研究在低秩矩陣 A 和 B 之間添加一個小的中間矩陣。其次,我們提出了 Nystr\"omLoRA (NLoRA),它利用基於 Nystr\"om 的初始化方法為 SLoRA 提升其有效性和效率。最後,我們提出了 IntermediateTune (IntTune),它探討了僅對 NLoRA 的中間矩陣進行微調,以進一步提升 LLM 效率。我們在五項自然語言生成 (NLG) 任務和八項自然語言理解 (NLU) 任務上評估了我們的這些方法。在 GSM8K 上,SLoRA 和 NLoRA 分別達到了 56.48% 和 57.70% 的準確率,比 LoRA 高出 33.52% 和 36.41%,而僅增加了 367 萬個可訓練參數。IntTune 在僅使用 LoRA 1.25% 的參數的情況下,將平均 NLG 效能提升了 7.45%。這些結果證明了我們的方法在以最少的參數開銷提升模型效能方面的效率和有效性。 +摘要:近年來,乳癌的盛行率迅速增加,使其成為全球主要的死亡原因之一。在所有癌症中,乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時,因此透過建立機器學習模型來預測,有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要,因為它們不僅可以提供準確的預測,還可以深入了解模型如何做出決策,有助於理解和信賴分類結果。在此研究中,我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數,使用了一個主要的資料集(達卡醫學院醫院的 500 名患者)。五種不同的監督式機器學習技術,包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost,已用於在我們的資料集上取得最佳結果。此外,本研究將 SHAP 分析應用於 XGBoost 模型,以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度,並與該領域的其他文獻進行對比。在最後評估後,本研究發現 XGBoost 達到了最佳的模型準確度,為 97%。 -##### **Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression** -2502.14477v1 by Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang +##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI** +2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir -Handling long-context sequences efficiently remains a significant challenge -in large language models (LLMs). Existing methods for token selection in -sequence extrapolation either employ a permanent eviction strategy or select -tokens by chunk, which may lead to the loss of critical information. We propose -Efficient Selective Attention (ESA), a novel approach that extends context -length by efficiently selecting the most critical tokens at the token level to -compute attention. ESA reduces the computational complexity of token selection -by compressing query and key vectors into lower-dimensional representations. We -evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using -open-source LLMs with context lengths of 8k and 32k. ESA outperforms other -selective attention methods, especially in tasks requiring the retrieval of -multiple pieces of information, achieving comparable performance to -full-attention extrapolation methods across various tasks, with superior -results in certain tasks. +The Deep learning (DL) models for diagnosing breast cancer from mammographic +images often operate as "black boxes", making it difficult for healthcare +professionals to trust and understand their decision-making processes. The +study presents an integrated framework combining Convolutional Neural Networks +(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis +of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an +elaborate data preprocessing pipeline and advanced data augmentation techniques +to counteract dataset limitations and transfer learning using pre-trained +networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of +our study is the evaluation of XAI's effectiveness in interpreting model +predictions, highlighted by utilizing the Hausdorff measure to assess the +alignment between AI-generated explanations and expert annotations +quantitatively. This approach is critical for XAI in promoting trustworthiness +and ethical fairness in AI-assisted diagnostics. The findings from our research +illustrate the effective collaboration between CNNs and XAI in advancing +diagnostic methods for breast cancer, thereby facilitating a more seamless +integration of advanced AI technologies within clinical settings. By enhancing +the interpretability of AI driven decisions, this work lays the groundwork for +improved collaboration between AI systems and medical practitioners, ultimately +enriching patient care. Furthermore, the implications of our research extended +well beyond the current methodologies. It encourages further research into how +to combine multimodal data and improve AI explanations to meet the needs of +clinical practice. -摘要:在大型語言模型 (LLM) 中,有效處理長語境序列仍然是一項重大挑戰。現有的序列外推標記選擇方法採用永久驅逐策略或按塊選擇標記,這可能會導致關鍵資訊遺失。我們提出高效選擇性注意 (ESA),這是一種新穎的方法,它透過在標記層級有效選擇最關鍵的標記來計算注意,從而延伸語境長度。ESA 透過將查詢和關鍵向量壓縮成較低維度的表示,來降低標記選擇的運算複雜度。我們使用開放原始碼 LLM,在語境長度為 8k 和 32k 的情況下,對長序列基準進行評估,最大長度達 256k。ESA 的表現優於其他選擇性注意方法,特別是在需要擷取多條資訊的任務中,在各種任務中達到與全注意外推方法相當的效能,並且在某些任務中獲得更佳的結果。 +摘要:深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作,這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構,結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI),以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術,以對抗資料集限制,並採用預先訓練的網路(例如 VGG-16、Inception-V3 和 ResNet)進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性,重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作,從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性,這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎,最終豐富了患者照護。此外,我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋,以滿足臨床實務的需求。 -##### **Argument-Based Comparative Question Answering Evaluation Benchmark** -2502.14476v1 by Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann +##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives** +2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu -In this paper, we aim to solve the problems standing in the way of automatic -comparative question answering. To this end, we propose an evaluation framework -to assess the quality of comparative question answering summaries. We formulate -15 criteria for assessing comparative answers created using manual annotation -and annotation from 6 large language models and two comparative question -asnwering datasets. We perform our tests using several LLMs and manual -annotation under different settings and demonstrate the constituency of both -evaluations. Our results demonstrate that the Llama-3 70B Instruct model -demonstrates the best results for summary evaluation, while GPT-4 is the best -for answering comparative questions. All used data, code, and evaluation -results are publicly -available\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}. +This research presents a novel multimodal data fusion methodology for pain +behavior recognition, integrating statistical correlation analysis with +human-centered insights. Our approach introduces two key innovations: 1) +integrating data-driven statistical relevance weights into the fusion strategy +to effectively utilize complementary information from heterogeneous modalities, +and 2) incorporating human-centric movement characteristics into multimodal +representation learning for detailed modeling of pain behaviors. Validated +across various deep learning architectures, our method demonstrates superior +performance and broad applicability. We propose a customizable framework that +aligns each modality with a suitable classifier based on statistical +significance, advancing personalized and effective multimodal fusion. +Furthermore, our methodology provides explainable analysis of multimodal data, +contributing to interpretable and explainable AI in healthcare. By highlighting +the importance of data diversity and modality-specific representations, we +enhance traditional fusion techniques and set new standards for recognizing +complex pain behaviors. Our findings have significant implications for +promoting patient-centered healthcare interventions and supporting explainable +clinical decision-making. -摘要:在本文中,我們旨在解決阻礙自動比較性問題解答的難題。為此,我們提出一個評估框架,用於評估比較性問題解答摘要的品質。我們制定了 15 項準則,用於評估使用手動標註和來自 6 個大型語言模型和兩個比較性問題解答資料集的標註所建立的比較性答案。我們在不同的設定下使用幾個 LLM 和手動標註執行測試,並展示兩種評估的組成。我們的結果表明,Llama-3 70B Instruct 模型在摘要評估中表現最佳,而 GPT-4 在回答比較性問題方面表現最佳。所有使用過的資料、程式碼和評估結果均公開可用\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}。 +摘要:本研究提出了一種創新的多模態數據融合方法,用於疼痛行為識別,將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新:1) 將數據驅動的統計相關權重整合到融合策略中,以有效利用來自異質模態的補充信息,以及 2) 將以人為中心的運動特徵納入多模態表示學習中,以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證,展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架,根據統計顯著性將每個模態與合適的分類器對齊,推進個性化和有效的多模態融合。此外,我們的模型提供對多模態數據的可解釋分析,有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性,我們增強了傳統的融合技術,並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。 -##### **Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models** -2502.14469v1 by Aurora Polo-Rodríguez, Laura Fiorini, Erika Rovini, Filippo Cavallo, Javier Medina-Quero +##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach** +2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini -This work presents a novel architecture for context-aware interactions within -smart environments, leveraging Large Language Models (LLMs) to enhance user -experiences. Our system integrates user location data obtained through UWB tags -and sensor-equipped smart homes with real-time human activity recognition (HAR) -to provide a comprehensive understanding of user context. This contextual -information is then fed to an LLM-powered chatbot, enabling it to generate -personalised interactions and recommendations based on the user's current -activity and environment. This approach moves beyond traditional static chatbot -interactions by dynamically adapting to the user's real-time situation. A case -study conducted from a real-world dataset demonstrates the feasibility and -effectiveness of our proposed architecture, showcasing its potential to create -more intuitive and helpful interactions within smart homes. The results -highlight the significant benefits of integrating LLM with real-time activity -and location data to deliver personalised and contextually relevant user -experiences. +Human-centered explainable AI (HCXAI) advocates for the integration of social +aspects into AI explanations. Central to the HCXAI discourse is the Social +Transparency (ST) framework, which aims to make the socio-organizational +context of AI systems accessible to their users. In this work, we suggest +extending the ST framework to address the risks of social misattributions in +Large Language Models (LLMs), particularly in sensitive areas like mental +health. In fact LLMs, which are remarkably capable of simulating roles and +personas, may lead to mismatches between designers' intentions and users' +perceptions of social attributes, risking to promote emotional manipulation and +dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To +address these issues, we propose enhancing the ST framework with a fifth +'W-question' to clarify the specific social attributions assigned to LLMs by +its designers and users. This addition aims to bridge the gap between LLM +capabilities and user perceptions, promoting the ethically responsible +development and use of LLM-based technology. -摘要:本研究提出了一種創新的架構,用於在智慧環境中進行情境感知互動,利用大型語言模型 (LLM) 來提升使用者體驗。我們的系統整合了透過超寬頻標籤取得的使用者位置資料,以及配備感測器的智慧家庭,並具備即時人類活動辨識 (HAR),以全面了解使用者的情境。接著,將這些情境資訊輸入 LLM 驅動的聊天機器人,讓它能根據使用者的當前活動和環境產生個人化的互動和建議。這種方法超越了傳統的靜態聊天機器人互動,能動態地適應使用者的即時狀況。從真實世界資料集進行的案例研究,展示了我們提出的架構的可行性和有效性,突顯出它在智慧家庭中創造更直覺且有用的互動的潛力。結果突顯了將 LLM 與即時活動和位置資料整合,以提供個人化且與情境相關的使用者體驗的顯著優點。 +摘要:以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架,其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中,我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险,尤其是在心理健康等敏感领域。事实上,LLM 能够出色地模拟角色和人格,这可能导致设计者的意图和用户对社会属性的认知之间出现错配,从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题,我们建议用第五个“W 问题”来增强 ST 框架,以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距,促进基于 LLM 的技术在道德上负责任地开发和使用。 -##### **Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing** -2502.14458v1 by Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, Albert Gu +##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification** +2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu -We introduce Llamba, a family of efficient recurrent language models -distilled from Llama-3.x into the Mamba architecture. The series includes -Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput -and handle significantly larger batch sizes than Transformer-based models while -maintaining comparable benchmark performance. Furthermore, Llamba demonstrates -the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., -2024), achieving these results with less than 0.1% of the training data -typically used for models of similar size. To take full advantage of their -efficiency, we provide an optimized implementation of Llamba for -resource-constrained devices such as smartphones and edge platforms, offering a -practical and memory-efficient alternative to Transformers. Overall, Llamba -improves the tradeoff between speed, memory efficiency, and performance, making -high-quality language models more accessible. +Background: Pneumothorax is an acute thoracic disease caused by abnormal air +collection between the lungs and chest wall. To address the opaqueness often +associated with deep learning (DL) models, explainable artificial intelligence +(XAI) methods have been introduced to outline regions related to pneumothorax +diagnoses made by DL models. However, these explanations sometimes diverge from +actual lesion areas, highlighting the need for further improvement. Method: We +propose a template-guided approach to incorporate the clinical knowledge of +pneumothorax into model explanations generated by XAI methods, thereby +enhancing the quality of these explanations. Utilizing one lesion delineation +created by radiologists, our approach first generates a template that +represents potential areas of pneumothorax occurrence. This template is then +superimposed on model explanations to filter out extraneous explanations that +fall outside the template's boundaries. To validate its efficacy, we carried +out a comparative analysis of three XAI methods with and without our template +guidance when explaining two DL models in two real-world datasets. Results: The +proposed approach consistently improved baseline XAI methods across twelve +benchmark scenarios built on three XAI methods, two DL models, and two +datasets. The average incremental percentages, calculated by the performance +improvements over the baseline performance, were 97.8% in Intersection over +Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model +explanations and ground-truth lesion areas. Conclusions: In the context of +pneumothorax diagnoses, we proposed a template-guided approach for improving AI +explanations. We anticipate that our template guidance will forge a fresh +approach to elucidating AI models by integrating clinical domain expertise. -摘要:我們推出 Llamba,一種高效的遞迴語言模型家族,從 Llama-3.x 萃取到 Mamba 架構中。該系列包含 Llamba-1B、Llamba-3B 和 Llamba-8B,它們比基於 Transformer 的模型實現更高的推理吞吐量,並處理顯著更大的批次大小,同時保持可比較的基準效能。此外,Llamba 證明了使用 MOHAWK(Bick 等人,2024 年)進行跨架構萃取的有效性,在訓練資料不到類似大小模型通常使用的 0.1% 的情況下實現了這些結果。為了充分利用其效率,我們為 Llamba 提供了針對資源受限裝置(例如智慧型手機和邊緣平台)的最佳化實作,提供實用且記憶體效率高的 Transformer 替代方案。總體而言,Llamba 改善了速度、記憶體效率和效能之間的權衡,讓高品質語言模型更易於取得。 +摘要:背景:氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習(DL)模型經常伴隨的不透明性,可解釋人工智慧(XAI)方法已被引入,用於概述與 DL 模型做出的氣胸診斷相關的區域。然而,這些解釋有時會與實際病灶區域有所出入,突顯出進一步改進的必要性。方法:我們提出了一種模板引導式方法,將氣胸的臨床知識納入 XAI 方法產生的模型解釋中,從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪,我們的做法首先產生一個模板,用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上,以篩選出超出模板邊界的無關解釋。為了驗證其效力,我們對三種 XAI 方法進行了比較分析,在兩個真實世界資料集中解釋兩個 DL 模型時,分別採用和不採用我們的模板引導。結果:所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中,始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時,透過基準效能的效能改進計算出的平均增量百分比為交集比(IoU)的 97.8% 和骰子相似性係數(DSC)的 94.1%。結論:在氣胸診斷的背景下,我們提出了一種模板引導式方法,用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識,為闡明 AI 模型建立一種新方法。 -##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** -2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu +##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures** +2403.01580v1 by Séamus Lankford -To enhance tourists' experiences and immersion, this paper proposes a -narrative-driven travel planning framework called NarrativeGuide, which -generates a geoculturally-grounded narrative script for travelers, offering a -novel, role-playing experience for their journey. In the initial stage, -NarrativeGuide constructs a knowledge graph for attractions within a city, then -configures the worldview, character setting, and exposition based on the -knowledge graph. Using this foundation, the knowledge graph is combined to -generate an independent scene unit for each attraction. During the itinerary -planning stage, NarrativeGuide models narrative-driven travel planning as an -optimization problem, utilizing a genetic algorithm (GA) to refine the -itinerary. Before evaluating the candidate itinerary, transition scripts are -generated for each pair of adjacent attractions, which, along with the scene -units, form a complete script. The weighted sum of script coherence, travel -time, and attraction scores is then used as the fitness value to update the -candidate solution set. Experimental results across four cities, i.e., Nanjing -and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate -significant improvements in narrative coherence and cultural fit, alongside a -notable reduction in travel time and an increase in the quality of visited -attractions. Our study highlights that incorporating external evolutionary -optimization effectively addresses the limitations of large language models in -travel planning.Our codes are available at -https://github.com/Evan01225/Narrative-Driven-Travel-Planning. +In the current machine translation (MT) landscape, the Transformer +architecture stands out as the gold standard, especially for high-resource +language pairs. This research delves into its efficacy for low-resource +language pairs including both the English$\leftrightarrow$Irish and +English$\leftrightarrow$Marathi language pairs. Notably, the study identifies +the optimal hyperparameters and subword model type to significantly improve the +translation quality of Transformer models for low-resource language pairs. + The scarcity of parallel datasets for low-resource languages can hinder MT +development. To address this, gaHealth was developed, the first bilingual +corpus of health data for the Irish language. Focusing on the health domain, +models developed using this in-domain dataset exhibited very significant +improvements in BLEU score when compared with models from the LoResMT2021 +Shared Task. A subsequent human evaluation using the multidimensional quality +metrics error taxonomy showcased the superior performance of the Transformer +system in reducing both accuracy and fluency errors compared to an RNN-based +counterpart. + Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source +applications streamlined for the development, fine-tuning, and deployment of +neural machine translation models. These tools considerably simplify the setup +and evaluation process, making MT more accessible to both developers and +translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes +eco-friendly natural language processing research by highlighting the +environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM +demonstrated advancements in translation performance for two low-resource +language pairs: English$\leftrightarrow$Irish and +English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 +Shared Task. -摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 +摘要:在當前機器翻譯 (MT) 領域中,Transformer 架構脫穎而出,成為黃金標準,特別是對於高資源語言對。本研究探討其對低資源語言對的效能,包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是,本研究識別出最佳超參數和子詞模型類型,以顯著提高 Transformer 模型對低資源語言對的翻譯品質。 +低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題,開發了 gaHealth,這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域,使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步,與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示,與基於 RNN 的對應模型相比,Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。 +此外,本論文介紹了 adaptNMT 和 adaptMLLM,這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程,讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是,adaptNMT 以 OpenNMT 生態系統為基礎,通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比,adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。 -##### **Optimal word order for non-causal text generation with Large Language Models: the Spanish case** -2502.14451v1 by Andrea Busto-Castiñeira, Silvia García-Méndez, Francisco de Arriba-Pérez, Francisco J. González-Castaño +##### **Cause and Effect: Can Large Language Models Truly Understand Causality?** +2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha -Natural Language Generation (NLG) popularity has increased owing to the -progress in Large Language Models (LLMs), with zero-shot inference -capabilities. However, most neural systems utilize decoder-only causal -(unidirectional) transformer models, which are effective for English but may -reduce the richness of languages with less strict word order, subject omission, -or different relative clause attachment preferences. This is the first work -that analytically addresses optimal text generation order for non-causal -language models. We present a novel Viterbi algorithm-based methodology for -maximum likelihood word order estimation. We analyze the non-causal -most-likelihood order probability for NLG in Spanish and, then, the probability -of generating the same phrases with Spanish causal NLG. This comparative -analysis reveals that causal NLG prefers English-like SVO structures. We also -analyze the relationship between optimal generation order and causal -left-to-right generation order using Spearman's rank correlation. Our results -demonstrate that the ideal order predicted by the maximum likelihood estimator -is not closely related to the causal order and may be influenced by the -syntactic structure of the target sentence. +With the rise of Large Language Models(LLMs), it has become crucial to +understand their capabilities and limitations in deciphering and explaining the +complex web of causal relationships that language entails. Current methods use +either explicit or implicit causal reasoning, yet there is a strong need for a +unified approach combining both to tackle a wide array of causal relationships +more effectively. This research proposes a novel architecture called Context +Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to +enhance causal reasoning and explainability. The proposed framework +incorporates an explicit causal detection module with ConceptNet and +counterfactual statements, as well as implicit causal detection through LLMs. +Our framework goes one step further with a layer of counterfactual explanations +to accentuate LLMs understanding of causality. The knowledge from ConceptNet +enhances the performance of multiple causal reasoning tasks such as causal +discovery, causal identification and counterfactual reasoning. The +counterfactual sentences add explicit knowledge of the not caused by scenarios. +By combining these powerful modules, our model aims to provide a deeper +understanding of causal relationships, enabling enhanced interpretability. +Evaluation of benchmark datasets shows improved performance across all metrics, +such as accuracy, precision, recall, and F1 scores. We also introduce +CausalNet, a new dataset accompanied by our code, to facilitate further +research in this domain. -摘要:自然語言生成 (NLG) 的普及歸功於大型語言模型 (LLM) 的進步,以及零次學習推論能力。然而,大多數神經系統使用僅解碼器因果 (單向) Transformer模型,這對英語很有效,但可能會減少語序較不嚴謹、省略主詞或相對從句附加偏好不同的語言的豐富性。這是第一個針對非因果語言模型分析性地解決最佳文字生成順序的研究。我們提出了一種基於維特比演算法的新方法,用於最大似然詞序估計。我們分析了西班牙語 NLG 的非因果最大似然順序機率,然後分析了使用西班牙語因果 NLG 生成相同短語的機率。這種比較分析顯示,因果 NLG 偏好英語式的 SVO 結構。我們還使用 Spearman 等級相關性分析最佳生成順序和因果從左到右生成順序之間的關係。我們的結果表明,最大似然估計器預測的理想順序與因果順序沒有密切關係,並且可能會受到目標句子的語法結構影響。 +摘要:隨著大型語言模型 (LLM) 的興起,了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理,但強烈需要一種統一的方法,結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構,以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組,以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步,加入一層反事實解釋,以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行,例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組,我們的模型旨在提供對因果關係更深入的理解,實現增強的可解釋性。基準資料集的評估顯示在所有指標(例如準確度、精確度、召回率和 F1 分數)上都有所提升。我們還引入了 CausalNet,一個新的資料集,並附上了我們的程式碼,以促進在這個領域的進一步研究。 +##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina** +2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya -### Knowledge Graphs +Diabetes mellitus (DM) predisposes patients to vascular complications. +Retinal images and vasculature reflect the body's micro- and macrovascular +health. They can be used to diagnose DM complications, including diabetic +retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular +disease, as well as forecast the risk of cardiovascular events. Artificial +intelligence (AI)-enabled systems developed for high-throughput detection of DR +using digitized retinal images have become clinically adopted. Beyond DR +screening, AI integration also holds immense potential to address challenges +associated with the holistic care of the patient with DM. In this work, we aim +to comprehensively review the literature for studies on AI applications based +on retinal images related to DM diagnosis, prognostication, and management. We +will describe the findings of holistic AI-assisted diabetes care, including but +not limited to DR screening, and discuss barriers to implementing such systems, +including issues concerning ethics, data privacy, equitable access, and +explainability. With the ability to evaluate the patient's health status vis a +vis DM complication as well as risk prognostication of future cardiovascular +complications, AI-assisted retinal image analysis has the potential to become a +central tool for modern personalized medicine in patients with DM. + +摘要:糖尿病(DM)使患者容易出現血管併發症。 +視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症,包括糖尿病視網膜病變(DR)、神經病變、腎病和動脈粥樣硬化性心血管疾病,以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧(AI)啟用系統已在臨床採用。除了 DR 篩檢外,AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中,我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻,這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現,包括但不限於 DR 篩檢,並討論實施此類系統的障礙,包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況,同時考量糖尿病併發症以及未來心血管併發症的風險預後,AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。 + + +### Medical |Publish Date|Title|Authors|Homepage|Code| | :---: | :---: | :---: | :---: | :---: | -|**2025-02-20**|**GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks**|Jianwen Luo et.al.|[2502.14848v1](http://arxiv.org/abs/2502.14848v1)|null| -|**2025-02-20**|**From RAG to Memory: Non-Parametric Continual Learning for Large Language Models**|Bernal Jiménez Gutiérrez et.al.|[2502.14802v1](http://arxiv.org/abs/2502.14802v1)|[link](https://github.com/osu-nlp-group/hipporag)| -|**2025-02-20**|**Plan-over-Graph: Towards Parallelable LLM Agent Schedule**|Shiqi Zhang et.al.|[2502.14563v1](http://arxiv.org/abs/2502.14563v1)|null| -|**2025-02-20**|**Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization**|Ran Ding et.al.|[2502.14456v1](http://arxiv.org/abs/2502.14456v1)|null| -|**2025-02-20**|**Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment**|Jiaxi Li et.al.|[2502.14275v1](http://arxiv.org/abs/2502.14275v1)|null| -|**2025-02-20**|**Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering**|Rongzhi Zhu et.al.|[2502.14245v1](http://arxiv.org/abs/2502.14245v1)|null| -|**2025-02-20**|**NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM**|Jiayin Lan et.al.|[2502.14192v1](http://arxiv.org/abs/2502.14192v1)|null| -|**2025-02-19**|**Object-centric Binding in Contrastive Language-Image Pretraining**|Rim Assouel et.al.|[2502.14113v1](http://arxiv.org/abs/2502.14113v1)|null| +|**2025-02-20**|**FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis**|Fadillah Maani et.al.|[2502.14807v1](http://arxiv.org/abs/2502.14807v1)|null| +|**2025-02-20**|**Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning**|Juraj Vladika et.al.|[2502.14765v1](http://arxiv.org/abs/2502.14765v1)|null| +|**2025-02-20**|**MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders**|Maya Varma et.al.|[2502.14753v1](http://arxiv.org/abs/2502.14753v1)|null| +|**2025-02-20**|**Data-Constrained Synthesis of Training Data for De-Identification**|Thomas Vakili et.al.|[2502.14677v1](http://arxiv.org/abs/2502.14677v1)|null| +|**2025-02-20**|**ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation**|Angxiao Yue et.al.|[2502.14637v1](http://arxiv.org/abs/2502.14637v1)|[link](https://github.com/AngxiaoYue/ReQFlow)| +|**2025-02-20**|**MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models**|Shrey Pandit et.al.|[2502.14302v1](http://arxiv.org/abs/2502.14302v1)|null| +|**2025-02-20**|**EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement**|Wenhui Zhu et.al.|[2502.14260v1](http://arxiv.org/abs/2502.14260v1)|null| |**2025-02-19**|**Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning**|Cole Gawin et.al.|[2502.14086v1](http://arxiv.org/abs/2502.14086v1)|null| -|**2025-02-19**|**Neurosymbolic artificial intelligence via large language models and coherence-driven inference**|Steve Huntsman et.al.|[2502.13953v1](http://arxiv.org/abs/2502.13953v1)|null| -|**2025-02-19**|**Complex Ontology Matching with Large Language Model Embeddings**|Guilherme Sousa et.al.|[2502.13619v1](http://arxiv.org/abs/2502.13619v1)|null| -|**2025-02-19**|**Are Large Language Models In-Context Graph Learners?**|Jintang Li et.al.|[2502.13562v1](http://arxiv.org/abs/2502.13562v1)|null| +|**2025-02-19**|**Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging**|Shansong Wang et.al.|[2502.14064v1](http://arxiv.org/abs/2502.14064v1)|null| +|**2025-02-19**|**VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare**|Anudeex Shetty et.al.|[2502.13775v1](http://arxiv.org/abs/2502.13775v1)|null| +|**2025-02-19**|**PeerQA: A Scientific Question Answering Dataset from Peer Reviews**|Tim Baumgärtner et.al.|[2502.13668v1](http://arxiv.org/abs/2502.13668v1)|[link](https://github.com/ukplab/peerqa)| |**2025-02-19**|**Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs**|Yushi Feng et.al.|[2502.13555v1](http://arxiv.org/abs/2502.13555v1)|[link](https://github.com/ys-feng/DemoGraph)| -|**2025-02-19**|**PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference**|Burc Gokden et.al.|[2502.13502v1](http://arxiv.org/abs/2502.13502v1)|[link](https://github.com/burcgokden/PLDR-LLM-with-KVG-cache)| -|**2025-02-19**|**Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction**|Yanbang Sun et.al.|[2502.13412v1](http://arxiv.org/abs/2502.13412v1)|null| -|**2025-02-19**|**Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval**|Aditya Sharma et.al.|[2502.13369v1](http://arxiv.org/abs/2502.13369v1)|null| -|**2025-02-19**|**Craw4LLM: Efficient Web Crawling for LLM Pretraining**|Shi Yu et.al.|[2502.13347v1](http://arxiv.org/abs/2502.13347v1)|[link](https://github.com/cxcscmu/crawl4llm)| -|**2025-02-18**|**K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction**|Tassallah Abdullahi et.al.|[2502.13344v1](http://arxiv.org/abs/2502.13344v1)|[link](https://github.com/rsinghlab/K-Paths)| -|**2025-02-18**|**Grounding LLM Reasoning with Knowledge Graphs**|Alfonso Amayuelas et.al.|[2502.13247v1](http://arxiv.org/abs/2502.13247v1)|null| -|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null| -|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null| -|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null| -|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null| -|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null| -|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|[link](https://github.com/yuhan1i/g-refer)| -|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null| -|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null| -|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null| -|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|[link](https://github.com/acnlplab/umr-text-gen)| -|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null| -|**2025-02-17**|**Exploring LLM-based Student Simulation for Metacognitive Cultivation**|Haoxuan Li et.al.|[2502.11678v1](http://arxiv.org/abs/2502.11678v1)|null| -|**2025-02-17**|**Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering**|Runxuan Liu et.al.|[2502.11491v1](http://arxiv.org/abs/2502.11491v1)|null| -|**2025-02-17**|**GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion**|Kangyang Luo et.al.|[2502.11471v1](http://arxiv.org/abs/2502.11471v1)|null| -|**2025-02-16**|**Large Language-Geometry Model: When LLM meets Equivariance**|Zongzhao Li et.al.|[2502.11149v2](http://arxiv.org/abs/2502.11149v2)|null| -|**2025-02-16**|**Beyond Pairwise: Global Zero-shot Temporal Graph Generation**|Alon Eirew et.al.|[2502.11114v1](http://arxiv.org/abs/2502.11114v1)|null| +|**2025-02-19**|**MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis**|Wei Dai et.al.|[2502.13524v1](http://arxiv.org/abs/2502.13524v1)|[link](https://github.com/anthonyweidai/MobileViM_3D)| +|**2025-02-19**|**Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion**|Shuai Niu et.al.|[2502.13509v1](http://arxiv.org/abs/2502.13509v1)|null| +|**2025-02-19**|**Towards a perturbation-based explanation for medical AI as differentiable programs**|Takeshi Abe et.al.|[2502.14001v1](http://arxiv.org/abs/2502.14001v1)|null| +|**2025-02-19**|**RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering**|Sichu Liang et.al.|[2502.13361v1](http://arxiv.org/abs/2502.13361v1)|null| +|**2025-02-18**|**Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance**|Tejas Srinivasan et.al.|[2502.13321v1](http://arxiv.org/abs/2502.13321v1)|null| +|**2025-02-18**|**Prediction of Clinical Complication Onset using Neural Point Processes**|Sachini Weerasekara et.al.|[2502.13290v1](http://arxiv.org/abs/2502.13290v1)|null| +|**2025-02-18**|**SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?**|Yucheng Shi et.al.|[2502.13233v1](http://arxiv.org/abs/2502.13233v1)|null| +|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null| +|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null| +|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null| +|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Li et.al.|[2502.12825v2](http://arxiv.org/abs/2502.12825v2)|null| +|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|[link](https://github.com/Avenge-PRC777/LLM-Safety-For-Children-Code)| +|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null| +|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null| +|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|[link](https://github.com/AmmarKheder/AQ-Net)| +|**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null| +|**2025-02-17**|**LLM Agents Making Agent Tools**|Georg Wölflein et.al.|[2502.11705v1](http://arxiv.org/abs/2502.11705v1)|null| +|**2025-02-17**|**MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression**|Linjie Mu et.al.|[2502.11651v1](http://arxiv.org/abs/2502.11651v1)|[link](https://github.com/linjiemu/mmxu)| +|**2025-02-17**|**A Survey of Personalized Large Language Models: Progress and Future Directions**|Jiahong Liu et.al.|[2502.11528v1](http://arxiv.org/abs/2502.11528v1)|null| +|**2025-02-17**|**Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos**|Xiangxiang Cui et.al.|[2502.11481v1](http://arxiv.org/abs/2502.11481v1)|null| +|**2025-02-17**|**Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation**|Yanyan Wang et.al.|[2502.11456v1](http://arxiv.org/abs/2502.11456v1)|[link](https://github.com/Yaan-Wang/CRLN)| +|**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null| +|**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|[link](https://github.com/sohyu1/rt-demt)| |**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|[link](https://github.com/alexlecu/llmkgraph)| -|**2025-02-16**|**Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection**|Yang Zhao et.al.|[2502.11062v1](http://arxiv.org/abs/2502.11062v1)|null| -|**2025-02-16**|**CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models**|Yuefei Chen et.al.|[2502.11008v1](http://arxiv.org/abs/2502.11008v1)|null| -|**2025-02-16**|**RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation**|Pengcheng Jiang et.al.|[2502.10996v1](http://arxiv.org/abs/2502.10996v1)|[link](https://github.com/pat-jj/Retrieval-And-Structure)| -|**2025-02-15**|**Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia**|Rohith Perumandla et.al.|[2502.10896v1](http://arxiv.org/abs/2502.10896v1)|null| -|**2025-02-15**|**Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)**|Sandra Schaftner et.al.|[2502.10768v1](http://arxiv.org/abs/2502.10768v1)|null| -|**2025-02-15**|**K-Edit: Language Model Editing with Contextual Knowledge Awareness**|Elan Markowitz et.al.|[2502.10626v1](http://arxiv.org/abs/2502.10626v1)|null| +|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null| +|**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|[link](https://github.com/clmfap/clmfap)| +|**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null| +|**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null| +|**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|[link](https://github.com/hasakixie123/llm-evaluator-uncertainty)| +|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)| +|**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null| |**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null| -|**2025-02-14**|**GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs**|Shima Khoshraftar et.al.|[2502.10522v1](http://arxiv.org/abs/2502.10522v1)|null| -|**2025-02-14**|**Do Large Language Models Reason Causally Like Us? Even Better?**|Hanna M. Dettki et.al.|[2502.10215v1](http://arxiv.org/abs/2502.10215v1)|null| -|**2025-02-14**|**Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages**|Daniil Gurgurov et.al.|[2502.10140v1](http://arxiv.org/abs/2502.10140v1)|null| -|**2025-02-14**|**Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models**|Chenrui Tie et.al.|[2502.10090v1](http://arxiv.org/abs/2502.10090v1)|null| -|**2025-02-14**|**Decision Information Meets Large Language Models: The Future of Explainable Operations Research**|Yansen Zhang et.al.|[2502.09994v1](http://arxiv.org/abs/2502.09994v1)|null| -|**2025-02-14**|**KGGen: Extracting Knowledge Graphs from Plain Text with Language Models**|Belinda Mo et.al.|[2502.09956v1](http://arxiv.org/abs/2502.09956v1)|null| -|**2025-02-14**|**ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation**|Shu Wang et.al.|[2502.09891v1](http://arxiv.org/abs/2502.09891v1)|null| -|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null| +|**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null| +|**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null| +|**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v2](http://arxiv.org/abs/2502.10526v2)|null| +|**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null| +|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null| +|**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null| +|**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null| +|**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null| +|**2025-02-14**|**HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation**|Tianwei Lin et.al.|[2502.09838v2](http://arxiv.org/abs/2502.09838v2)|[link](https://github.com/dcdmllm/healthgpt)| +|**2025-02-13**|**Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games**|Tong Yang et.al.|[2502.09780v1](http://arxiv.org/abs/2502.09780v1)|null| +|**2025-02-13**|**The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention**|Bereket A. Yilma et.al.|[2502.09757v1](http://arxiv.org/abs/2502.09757v1)|null| +|**2025-02-13**|**A CNN Approach to Automated Detection and Classification of Brain Tumors**|Md. Zahid Hasan et.al.|[2502.09731v1](http://arxiv.org/abs/2502.09731v1)|null| +|**2025-02-13**|**Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data**|Yu Leng et.al.|[2502.09715v1](http://arxiv.org/abs/2502.09715v1)|null| +|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null| +|**2025-02-13**|**Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling**|Benjamin D. Killeen et.al.|[2502.09688v1](http://arxiv.org/abs/2502.09688v1)|null| +|**2025-02-13**|**Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models**|Wiktoria Mieleszczenko-Kowszewicz et.al.|[2502.09687v1](http://arxiv.org/abs/2502.09687v1)|null| +|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null| +|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null| +|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null| +|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null| +|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null| +|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null| +|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)| +|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)| |**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null| -|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null| -|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v3](http://arxiv.org/abs/2502.08346v3)|null| -|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null| -|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null| -|**2025-02-12**|**LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search**|Yang Gao et.al.|[2502.10459v1](http://arxiv.org/abs/2502.10459v1)|null| -|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null| -|**2025-02-12**|**Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality**|Xin Kang et.al.|[2502.09658v1](http://arxiv.org/abs/2502.09658v1)|null| -|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null| -|**2025-02-12**|**Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach**|Régnier Avice et.al.|[2502.10453v1](http://arxiv.org/abs/2502.10453v1)|[link](https://github.com/ravice234/cryptoasset-attribution-tag-linker)| -|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null| -|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null| -|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)| -|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null| -|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|[link](https://github.com/YuxingLu613/KARMA)| -|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null| -|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null| -|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null| -|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null| -|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null| -|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null| -|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null| -|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null| -|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)| -|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null| -|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|[link](https://github.com/Infini-AI-Lab/gsm_infinite)| -|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null| -|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)| -|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null| -|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)| -|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)| -|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null| -|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)| -|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)| -|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null| -|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null| -|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null| -|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null| -|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null| -|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null| -|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null| -|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null| -|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null| -|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null| -|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null| -|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)| -|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null| -|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null| -|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)| +|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null| +|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null| +|**2025-02-12**|**Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models**|Hasin Rehana et.al.|[2502.09659v1](http://arxiv.org/abs/2502.09659v1)|null| +|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null| +|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null| +|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v2](http://arxiv.org/abs/2502.07752v2)|null| +|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)| +|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)| +|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null| +|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)| +|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null| +|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null| +|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null| +|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null| +|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null| +|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null| +|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null| +|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null| +|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null| +|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null| +|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null| +|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null| +|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)| +|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null| +|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null| +|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null| +|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null| +|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null| +|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null| +|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null| +|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)| #### Abstracts -##### **GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks** -2502.14848v1 by Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu - -Large Language Models (LLMs) have shown great promise in tool-making, yet -existing frameworks often struggle to efficiently construct reliable toolsets -and are limited to single-task settings. To address these challenges, we -propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that -dynamically constructs and evolves a hierarchical graph of reusable tools -across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), -agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, -TabMWP). Our results show that GATE achieves up to 4.3x faster milestone -completion in Minecraft compared to the previous SOTA, and provides an average -improvement of 9.23% over existing tool-making methods in code generation tasks -and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, -balancing tool quantity, complexity, and functionality while maintaining high -efficiency. Code and data are available at -\url{https://github.com/ayanami2003/GATE}. - -摘要:大型語言模型 (LLM) 在工具製作方面展現出極大的潛力,然而現有的框架經常難以有效地建構可靠的工具組,並且僅限於單一任務設定。為了應對這些挑戰,我們提出了 GATE(基於圖形的自適應工具演化),這是一個自適應框架,可跨多個場景動態建構和演化可重複使用的工具階層圖。我們在開放式任務(Minecraft)、基於代理的任務(TextCraft、DABench)和程式碼生成任務(MATH、Date、TabMWP)上評估了 GATE。我們的結果顯示,與先前的 SOTA 相比,GATE 在 Minecraft 中實現了高達 4.3 倍的里程碑完成速度,並且在程式碼生成任務中提供了比現有工具製作方法平均提升 9.23%,在代理任務中提升了 10.03%。GATE 展示了自適應演化的力量,在保持高效率的同時,平衡了工具數量、複雜性和功能性。程式碼和資料可在 \url{https://github.com/ayanami2003/GATE} 取得。 - -##### **From RAG to Memory: Non-Parametric Continual Learning for Large Language Models** -2502.14802v1 by Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su +##### **FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis** +2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub -Our ability to continuously acquire, organize, and leverage knowledge is a -key feature of human intelligence that AI systems must approximate to unlock -their full potential. Given the challenges in continual learning with large -language models (LLMs), retrieval-augmented generation (RAG) has become the -dominant way to introduce new information. However, its reliance on vector -retrieval hinders its ability to mimic the dynamic and interconnected nature of -human long-term memory. Recent RAG approaches augment vector embeddings with -various structures like knowledge graphs to address some of these gaps, namely -sense-making and associativity. However, their performance on more basic -factual memory tasks drops considerably below standard RAG. We address this -unintended deterioration and propose HippoRAG 2, a framework that outperforms -standard RAG comprehensively on factual, sense-making, and associative memory -tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in -HippoRAG and enhances it with deeper passage integration and more effective -online use of an LLM. This combination pushes this RAG system closer to the -effectiveness of human long-term memory, achieving a 7% improvement in -associative memory tasks over the state-of-the-art embedding model while also -exhibiting superior factual knowledge and sense-making memory capabilities. -This work paves the way for non-parametric continual learning for LLMs. Our -code and data will be released at https://github.com/OSU-NLP-Group/HippoRAG. +Foundation models are becoming increasingly effective in the medical domain, +offering pre-trained models on large datasets that can be readily adapted for +downstream tasks. Despite progress, fetal ultrasound images remain a +challenging domain for foundation models due to their inherent complexity, +often requiring substantial additional training and facing limitations due to +the scarcity of paired multimodal data. To overcome these challenges, here we +introduce FetalCLIP, a vision-language foundation model capable of generating +universal representation of fetal ultrasound images. FetalCLIP was pre-trained +using a multimodal learning approach on a diverse dataset of 210,035 fetal +ultrasound images paired with text. This represents the largest paired dataset +of its kind used for foundation model development to date. This unique training +approach allows FetalCLIP to effectively learn the intricate anatomical +features present in fetal ultrasound images, resulting in robust +representations that can be used for a variety of downstream applications. In +extensive benchmarking across a range of key fetal ultrasound applications, +including classification, gestational age estimation, congenital heart defect +(CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all +baselines while demonstrating remarkable generalizability and strong +performance even with limited labeled data. We plan to release the FetalCLIP +model publicly for the benefit of the broader scientific community. -摘要:我們持續獲取、組織和利用知識的能力是人類智慧的一項關鍵特徵,而人工智慧系統必須近似於此才能發揮其全部潛力。由於大型語言模型 (LLM) 持續學習的挑戰,檢索增強生成 (RAG) 已成為引入新資訊的主流方式。然而,它依賴向量檢索阻礙了它模擬人類長期記憶的動態和相互連結的本質。最近的 RAG 方法用各種結構(如知識圖譜)增強向量嵌入,以解決其中一些差距,即意義建構和聯想性。然而,它們在更基本的實際記憶任務上的表現遠低於標準 RAG。我們解決了這種意外的惡化,並提出了 HippoRAG 2,這是一個在實際、意義建構和聯想記憶任務上全面優於標準 RAG 的框架。HippoRAG 2 建立在 HippoRAG 中使用的 Personalized PageRank 演算法之上,並透過更深入的段落整合和更有效的 LLM 線上使用來增強它。這種組合將此 RAG 系統推向更接近人類長期記憶的效能,在聯想記憶任務上比最先進的嵌入模型提升了 7%,同時也展現出優異的實際知識和意義建構記憶能力。這項工作為 LLM 的非參數持續學習鋪平了道路。我們的程式碼和資料將在 https://github.com/OSU-NLP-Group/HippoRAG 上發布。 +摘要:基礎模型在醫療領域正變得越來越有效, +提供在大型資料集上預先訓練的模型,可輕鬆適應 +下游任務。儘管有進展,但胎兒超音波影像仍然是 +基礎模型的挑戰領域,因為它們固有的複雜性, +通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 +介紹 FetalCLIP,一種能夠產生 +胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 +超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 +方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 +表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 +(CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 +效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。 -##### **Plan-over-Graph: Towards Parallelable LLM Agent Schedule** -2502.14563v1 by Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao +##### **Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning** +2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes -Large Language Models (LLMs) have demonstrated exceptional abilities in -reasoning for task planning. However, challenges remain under-explored for -parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in -which the model first decomposes a real-life textual task into executable -subtasks and constructs an abstract task graph. The model then understands this -task graph as input and generates a plan for parallel execution. To enhance the -planning capability of complex, scalable graphs, we design an automated and -controllable pipeline to generate synthetic graphs and propose a two-stage -training scheme. Experimental results show that our plan-over-graph method -significantly improves task performance on both API-based LLMs and trainable -open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally -supports parallel execution, demonstrating global efficiency. The code and data -are available at https://github.com/zsq259/Plan-over-Graph. +Fact verification (FV) aims to assess the veracity of a claim based on +relevant evidence. The traditional approach for automated FV includes a +three-part pipeline relying on short evidence snippets and encoder-only +inference models. More recent approaches leverage the multi-turn nature of LLMs +to address FV as a step-by-step problem where questions inquiring additional +context are generated and answered until there is enough information to make a +decision. This iterative method makes the verification process rational and +explainable. While these methods have been tested for encyclopedic claims, +exploration on domain-specific and realistic claims is missing. In this work, +we apply an iterative FV system on three medical fact-checking datasets and +evaluate it with multiple settings, including different LLMs, external web +search, and structured reasoning using logic predicates. We demonstrate +improvements in the final performance over traditional approaches and the high +potential of step-by-step FV systems for domain-specific claims. -摘要:大型語言模型 (LLM) 已展現出在任務規劃推理方面的非凡能力。然而,對於並行時程表的挑戰仍未充分探討。本文介紹了一個新穎的範例,即圖形規劃,其中模型首先將現實生活中的文字任務分解為可執行的子任務,並建構一個抽象任務圖。然後,模型將此任務圖理解為輸入,並產生一個並行執行的計畫。為了增強複雜、可擴充圖形的規劃能力,我們設計了一個自動化且可控的管道來產生合成圖形,並提出了一個兩階段訓練方案。實驗結果表明,我們的圖形規劃方法顯著提升了基於 API 的 LLM 和可訓練的開源 LLM 的任務效能。透過將複雜任務標準化為圖形,我們的模型自然支援並行執行,展現出整體效率。程式碼和資料可在 https://github.com/zsq259/Plan-over-Graph 取得。 +摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。 -##### **Narrative-Driven Travel Planning: Geoculturally-Grounded Script Generation with Evolutionary Itinerary Optimization** -2502.14456v1 by Ran Ding, Ziyu Zhang, Ying Zhu, Ziqian Kong, Peilan Xu +##### **MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders** +2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari -To enhance tourists' experiences and immersion, this paper proposes a -narrative-driven travel planning framework called NarrativeGuide, which -generates a geoculturally-grounded narrative script for travelers, offering a -novel, role-playing experience for their journey. In the initial stage, -NarrativeGuide constructs a knowledge graph for attractions within a city, then -configures the worldview, character setting, and exposition based on the -knowledge graph. Using this foundation, the knowledge graph is combined to -generate an independent scene unit for each attraction. During the itinerary -planning stage, NarrativeGuide models narrative-driven travel planning as an -optimization problem, utilizing a genetic algorithm (GA) to refine the -itinerary. Before evaluating the candidate itinerary, transition scripts are -generated for each pair of adjacent attractions, which, along with the scene -units, form a complete script. The weighted sum of script coherence, travel -time, and attraction scores is then used as the fitness value to update the -candidate solution set. Experimental results across four cities, i.e., Nanjing -and Yangzhou in China, Paris in France, and Berlin in Germany, demonstrate -significant improvements in narrative coherence and cultural fit, alongside a -notable reduction in travel time and an increase in the quality of visited -attractions. Our study highlights that incorporating external evolutionary -optimization effectively addresses the limitations of large language models in -travel planning.Our codes are available at -https://github.com/Evan01225/Narrative-Driven-Travel-Planning. +Medical images are acquired at high resolutions with large fields of view in +order to capture fine-grained features necessary for clinical decision-making. +Consequently, training deep learning models on medical images can incur large +computational costs. In this work, we address the challenge of downsizing +medical images in order to improve downstream computational efficiency while +preserving clinically-relevant features. We introduce MedVAE, a family of six +large-scale 2D and 3D autoencoders capable of encoding medical images as +downsized latent representations and decoding latent representations back to +high-resolution images. We train MedVAE autoencoders using a novel two-stage +training approach with 1,052,730 medical images. Across diverse tasks obtained +from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent +representations in place of high-resolution images when training downstream +models can lead to efficiency benefits (up to 70x improvement in throughput) +while simultaneously preserving clinically-relevant features and (2) MedVAE can +decode latent representations back to high-resolution images with high +fidelity. Our work demonstrates that large-scale, generalizable autoencoders +can help address critical efficiency challenges in the medical domain. Our code +is available at https://github.com/StanfordMIMI/MedVAE. -摘要:為了增強遊客的體驗和沉浸感,本文提出了一個名為 NarrativeGuide 的敘事驅動旅遊規劃框架,它會為旅客產生一個以地理文化為基礎的敘事腳本,為他們的旅程提供一個新穎的角色扮演體驗。在初始階段,NarrativeGuide 會為城市內的景點建立一個知識圖譜,然後根據知識圖譜配置世界觀、角色設定和說明。利用這個基礎,知識圖譜會與每個景點結合,為其產生一個獨立的場景單元。在行程規劃階段,NarrativeGuide 將敘事驅動的旅遊規劃建模為一個最佳化問題,利用遺傳演算法 (GA) 來優化行程。在評估候選行程之前,會為每對相鄰景點產生過場腳本,這些腳本會與場景單元一起形成一個完整的腳本。接著,將腳本連貫性、旅遊時間和景點分數的加權和用作適應值,以更新候選解集。在四個城市(即中國的南京和揚州、法國的巴黎和德國的柏林)進行的實驗結果顯示,敘事連貫性和文化契合度都有顯著的提升,同時旅遊時間大幅減少,且所參觀景點的品質也提升了。我們的研究強調,納入外部演化最佳化能有效解決大型語言模型在旅遊規劃中的限制。我們的程式碼可在 https://github.com/Evan01225/Narrative-Driven-Travel-Planning 取得。 +摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。 -##### **Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment** -2502.14275v1 by Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu +##### **Data-Constrained Synthesis of Training Data for De-Identification** +2502.14677v1 by Thomas Vakili, Aron Henriksson, Hercules Dalianis -Large language models (LLMs) have been widely adopted in various downstream -task domains. However, their ability to directly recall and apply factual -medical knowledge remains under-explored. Most existing medical QA benchmarks -assess complex reasoning or multi-hop inference, making it difficult to isolate -LLMs' inherent medical knowledge from their reasoning capabilities. Given the -high-stakes nature of medical applications, where incorrect information can -have critical consequences, it is essential to evaluate how well LLMs encode, -retain, and recall fundamental medical facts. - To bridge this gap, we introduce the Medical Knowledge Judgment, a dataset -specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ -is constructed from the Unified Medical Language System (UMLS), a large-scale -repository of standardized biomedical vocabularies and knowledge graphs. We -frame knowledge assessment as a binary judgment task, requiring LLMs to verify -the correctness of medical statements extracted from reliable and structured -knowledge sources. - Our experiments reveal that LLMs struggle with factual medical knowledge -retention, exhibiting significant performance variance across different -semantic categories, particularly for rare medical conditions. Furthermore, -LLMs show poor calibration, often being overconfident in incorrect answers. To -mitigate these issues, we explore retrieval-augmented generation, demonstrating -its effectiveness in improving factual accuracy and reducing uncertainty in -medical decision-making. +Many sensitive domains -- such as the clinical domain -- lack widely +available datasets due to privacy risks. The increasing generative capabilities +of large language models (LLMs) have made synthetic datasets a viable path +forward. In this study, we domain-adapt LLMs to the clinical domain and +generate synthetic clinical texts that are machine-annotated with tags for +personally identifiable information using capable encoder-based NER models. The +synthetic corpora are then used to train synthetic NER models. The results show +that training NER models using synthetic corpora incurs only a small drop in +predictive performance. The limits of this process are investigated in a +systematic ablation study -- using both Swedish and Spanish data. Our analysis +shows that smaller datasets can be sufficient for domain-adapting LLMs for data +synthesis. Instead, the effectiveness of this process is almost entirely +contingent on the performance of the machine-annotating NER models trained +using the original data. -摘要:大型語言模型 (LLM) 已廣泛應用於各種下游 -任務領域。然而,它們直接回憶和應用事實 -醫學知識的能力仍未得到充分探索。大多數現有的醫療問答基準 -評估複雜推理或多跳躍推論,這使得難以將 -LLM 內在的醫學知識從其推理能力中分離出來。鑑於 -醫療應用具有高風險,其中不正確的資訊可能會 -造成嚴重後果,因此評估 LLM 編碼、 -保留和回憶基本醫學事實的能力至關重要。 -為了彌合這一差距,我們引入了醫學知識判斷,這是一個專門設計用於測量 LLM 的一跳事實醫學知識的數據集。MKJ -是由統一醫學語言系統 (UMLS) 構建的,UMLS 是標準化生物醫學詞彙和知識圖譜的大型庫。我們 -將知識評估構建為二元判斷任務,要求 LLM 驗證從可靠且結構化的 -知識來源中提取的醫學陳述的正確性。 -我們的實驗表明,LLM 難以保留事實醫學知識,在不同的 -語義類別中表現出顯著的性能差異,特別是對於罕見的醫療狀況。此外, -LLM 表現出校準不佳,通常對不正確的答案過於自信。為了 -減輕這些問題,我們探索了檢索增強生成,證明了其在提高事實準確性和降低不確定性方面的有效性 -在醫療決策制定中。 +摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 不斷增強的生成能力已使合成資料集成為可行的途徑。在這項研究中,我們將領域適應 LLM 應用於臨床領域,並生成使用具備編碼器功能的 NER 模型以個人可識別資訊標籤進行機器標註的合成臨床文本。然後使用合成語料庫來訓練合成 NER 模型。結果顯示,使用合成語料庫訓練 NER 模型僅會導致預測效能略微下降。在系統消融研究中調查此程序的限制,同時使用瑞典語和西班牙語資料。我們的分析顯示,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反地,此程序的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。 -##### **Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering** -2502.14245v1 by Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu +##### **ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation** +2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu -In this paper, we identify a critical problem, "lost-in-retrieval", in -retrieval-augmented multi-hop question answering (QA): the key entities are -missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly -degrades the retrieval performance, which disrupts the reasoning chain and -leads to the incorrect answers. To resolve this problem, we propose a -progressive retrieval and rewriting method, namely ChainRAG, which sequentially -handles each sub-question by completing missing key entities and retrieving -relevant sentences from a sentence graph for answer generation. Each step in -our retrieval and rewriting process builds upon the previous one, creating a -seamless chain that leads to accurate retrieval and answers. Finally, all -retrieved sentences and sub-question answers are integrated to generate a -comprehensive answer to the original question. We evaluate ChainRAG on three -multi-hop QA datasets$\unicode{x2013}$MuSiQue, 2Wiki, and -HotpotQA$\unicode{x2013}$using three large language models: GPT4o-mini, -Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG -consistently outperforms baselines in both effectiveness and efficiency. +Protein backbone generation plays a central role in de novo protein design +and is significant for many biological and medical applications. Although +diffusion and flow-based generative models provide potential solutions to this +challenging task, they often generate proteins with undesired designability and +suffer computational inefficiency. In this study, we propose a novel rectified +quaternion flow (ReQFlow) matching method for fast and high-quality protein +backbone generation. In particular, our method generates a local translation +and a 3D rotation from random noise for each residue in a protein chain, which +represents each 3D rotation as a unit quaternion and constructs its flow by +spherical linear interpolation (SLERP) in an exponential format. We train the +model by quaternion flow (QFlow) matching with guaranteed numerical stability +and rectify the QFlow model to accelerate its inference and improve the +designability of generated protein backbones, leading to the proposed ReQFlow +model. Experiments show that ReQFlow achieves state-of-the-art performance in +protein backbone generation while requiring much fewer sampling steps and +significantly less inference time (e.g., being 37x faster than RFDiffusion and +62x faster than Genie2 when generating a backbone of length 300), demonstrating +its effectiveness and efficiency. The code is available at +https://github.com/AngxiaoYue/ReQFlow. -摘要:在本文中,我們在檢索增強的多跳問答 (QA) 中發現了一個關鍵問題「檢索中遺失」,關鍵實體遺失在 LLM 的子問題分解中。「檢索中遺失」顯著降低檢索效能,這會中斷推理鏈並導致錯誤的答案。為了解決此問題,我們提出了一種漸進式檢索和重寫方法,即 ChainRAG,它通過完成遺失的關鍵實體並從句子圖中檢索相關句子來順序處理每個子問題以產生答案。我們檢索和重寫過程中每一步都建立在前一步之上,創造了一個無縫的鏈,導致準確的檢索和答案。最後,所有檢索到的句子和子問題答案都整合起來,以產生對原始問題的全面答案。我們在三個多跳問答資料集$\unicode{x2013}$MuSiQue、2Wiki 和 HotpotQA$\unicode{x2013}$上評估 ChainRAG,使用三個大型語言模型:GPT4o-mini、Qwen2.5-72B 和 GLM-4-Plus。實證結果表明,ChainRAG 在有效性和效率方面都持續優於基準。 +摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。 -##### **NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM** -2502.14192v1 by Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin +##### **MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models** +2502.14302v1 by Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding -Large language models (LLMs) have been widely applied in question answering -over scientific research papers. To enhance the professionalism and accuracy of -responses, many studies employ external knowledge augmentation. However, -existing structures of external knowledge in scientific literature often focus -solely on either paper entities or domain concepts, neglecting the intrinsic -connections between papers through shared domain concepts. This results in less -comprehensive and specific answers when addressing questions that combine -papers and concepts. To address this, we propose a novel knowledge graph -framework that captures deep conceptual relations between academic papers, -constructing a relational network via intra-paper semantic elements and -inter-paper citation relations. Using a few-shot knowledge graph construction -method based on LLM, we develop NLP-AKG, an academic knowledge graph for the -NLP domain, by extracting 620,353 entities and 2,271,584 relations from 60,826 -papers in ACL Anthology. Based on this, we propose a 'sub-graph community -summary' method and validate its effectiveness on three NLP scientific -literature question answering datasets. +Advancements in Large Language Models (LLMs) and their increasing use in +medical question-answering necessitate rigorous evaluation of their +reliability. A critical challenge lies in hallucination, where models generate +plausible yet factually incorrect outputs. In the medical domain, this poses +serious risks to patient safety and clinical decision-making. To address this, +we introduce MedHallu, the first benchmark specifically designed for medical +hallucination detection. MedHallu comprises 10,000 high-quality question-answer +pairs derived from PubMedQA, with hallucinated answers systematically generated +through a controlled pipeline. Our experiments show that state-of-the-art LLMs, +including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, +struggle with this binary hallucination detection task, with the best model +achieving an F1 score as low as 0.625 for detecting "hard" category +hallucinations. Using bidirectional entailment clustering, we show that +harder-to-detect hallucinations are semantically closer to ground truth. +Through experiments, we also show incorporating domain-specific knowledge and +introducing a "not sure" category as one of the answer categories improves the +precision and F1 scores by up to 38% relative to baselines. -摘要:大型语言模型 (LLM) 已广泛应用于科学研究论文的问答中。为了提高响应的专业性和准确性,许多研究采用外部知识增强。然而,科学文献中现有外部知识的结构通常仅关注论文实体或领域概念,而忽略了论文之间通过共享领域概念而形成的内在联系。这导致在解决结合论文和概念的问题时,答案不够全面和具体。为了解决这个问题,我们提出了一种新颖的知识图谱框架,该框架捕获了学术论文之间的深层概念关系,通过论文内部语义元素和论文之间的引用关系构建关系网络。我们使用基于 LLM 的少量知识图谱构建方法,从 ACL Anthology 中的 60,826 篇论文中提取了 620,353 个实体和 2,271,584 个关系,开发了 NLP 领域的学术知识图谱 NLP-AKG。在此基础上,我们提出了一种“子图社区摘要”方法,并在三个 NLP 科学文献问答数据集上验证了其有效性。 +摘要:大型語言模型 (LLM) 的進步及其在醫療問答中的使用日益增加,因此需要嚴格評估其可靠性。一個關鍵的挑戰在於幻覺,模型會產生看似合理但事實上不正確的輸出。在醫療領域,這對患者安全和臨床決策構成嚴重風險。為了解決此問題,我們推出了 MedHallu,這是第一個專門設計用於檢測醫療幻覺的基準。MedHallu 包含 10,000 個從 PubMedQA 衍生的高品質問答對,並透過受控管道系統性地產生幻覺答案。我們的實驗顯示,包括 GPT-4o、Llama-3.1 和經過醫學微調的 UltraMedical 在內的最新 LLM 難以執行這個二元幻覺檢測任務,最佳模型在檢測「困難」類別幻覺時達到的 F1 分數低至 0.625。使用雙向蘊涵聚類,我們表明較難檢測的幻覺在語義上更接近真實。透過實驗,我們還表明,納入特定領域的知識並將「不確定」類別作為其中一個答案類別,可以將精確度和 F1 分數相對於基線提高多達 38%。 -##### **Object-centric Binding in Contrastive Language-Image Pretraining** -2502.14113v1 by Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano +##### **EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement** +2502.14260v1 by Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang -Recent advances in vision language models (VLM) have been driven by -contrastive models such as CLIP, which learn to associate visual information -with their corresponding text descriptions. However, these models have -limitations in understanding complex compositional scenes involving multiple -objects and their spatial relationships. To address these challenges, we -propose a novel approach that diverges from commonly used strategies, which -rely on the design of hard-negative augmentations. Instead, our work focuses on -integrating inductive biases into pre-trained CLIP-like models to improve their -compositional understanding without using any additional hard-negatives. To -that end, we introduce a binding module that connects a scene graph, derived -from a text description, with a slot-structured image representation, -facilitating a structured similarity assessment between the two modalities. We -also leverage relationships as text-conditioned visual constraints, thereby -capturing the intricate interactions between objects and their contextual -relationships more effectively. Our resulting model not only enhances the -performance of CLIP-based models in multi-object compositional understanding -but also paves the way towards more accurate and sample-efficient image-text -matching of complex scenes. +Over the past decade, generative models have achieved significant success in +enhancement fundus images.However, the evaluation of these models still +presents a considerable challenge. A comprehensive evaluation benchmark for +fundus image enhancement is indispensable for three main reasons: 1) The +existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to +downstream real-world clinical research (e.g., Vessel morphology consistency). +2) There is a lack of comprehensive evaluation for both paired and unpaired +enhancement methods, along with the need for expert protocols to accurately +assess clinical value. 3) An ideal evaluation system should provide insights to +inform future developments of fundus image enhancement. To this end, we propose +a novel comprehensive benchmark, EyeBench, to provide insights that align +enhancement models with clinical needs, offering a foundation for future work +to improve the clinical relevance and applicability of generative models for +fundus image enhancement. EyeBench has three appealing properties: 1) +multi-dimensional clinical alignment downstream evaluation: In addition to +evaluating the enhancement task, we provide several clinically significant +downstream tasks for fundus images, including vessel segmentation, DR grading, +denoising generalization, and lesion segmentation. 2) Medical expert-guided +evaluation design: We introduce a novel dataset that promote comprehensive and +fair comparisons between paired and unpaired methods and includes a manual +evaluation protocol by medical experts. 3) Valuable insights: Our benchmark +study provides a comprehensive and rigorous evaluation of existing methods +across different downstream tasks, assisting medical experts in making informed +choices. Additionally, we offer further analysis of the challenges faced by +existing methods. The code is available at +\url{https://github.com/Retinal-Research/EyeBench} -摘要:最近视觉语言模型 (VLM) 的进步是由对比模型(例如 CLIP)推动的,该模型学习将视觉信息与其对应的文本描述联系起来。然而,这些模型在理解涉及多个对象及其空间关系的复杂组合场景方面存在局限性。为了应对这些挑战,我们提出了一种新颖的方法,它偏离了常用的策略,即依赖于硬负增强设计。相反,我们的工作重点是将归纳偏差集成到预训练的类似 CLIP 的模型中,以提高其组合理解能力,而无需使用任何其他硬否定。为此,我们引入了一个绑定模块,它将从文本描述中派生的场景图与槽结构图像表示连接起来,从而促进了两种模式之间的结构化相似性评估。我们还利用关系作为文本条件的视觉约束,从而更有效地捕捉对象及其上下文关系之间的复杂交互。我们由此产生的模型不仅增强了基于 CLIP 的模型在多对象组合理解中的性能,而且还为复杂场景的更准确和样本高效的图像文本匹配铺平了道路。 +摘要:在過去的十年中,生成模型在增強眼底影像方面取得了顯著的成功。然而,這些模型的評估仍然是一個相當大的挑戰。一個全面的眼底影像增強評估基準對於三個主要原因是不可或缺的:1) 現有的去噪指標(例如 PSNR、SSIM)很難擴展到下游的真實世界臨床研究(例如血管形態一致性)。2) 缺乏對配對和非配對增強方法的全面評估,以及需要專家協議來準確評估臨床價值。3) 一個理想的評估系統應該提供見解,以告知眼底影像增強的未來發展。為此,我們提出了一個新的綜合基準 EyeBench,以提供見解,將增強模型與臨床需求相結合,為未來的研究奠定基礎,以提高生成模型在眼底影像增強方面的臨床相關性和適用性。EyeBench 有三個吸引人的特性:1) 多維臨床對齊下游評估:除了評估增強任務外,我們還為眼底影像提供了幾個臨床上重要的下游任務,包括血管分割、DR 分級、去噪泛化和病灶分割。2) 醫學專家指導的評估設計:我們引入了一個新的數據集,以促進對配對和非配對方法的全面和公平比較,並包括由醫學專家進行的手動評估協議。3) 有價值的見解:我們的基準研究提供了對現有方法在不同下游任務中的全面且嚴格的評估,協助醫學專家做出明智的選擇。此外,我們還進一步分析了現有方法面臨的挑戰。程式碼可在 \url{https://github.com/Retinal-Research/EyeBench} 獲得 ##### **Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning** 2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal @@ -7975,58 +7904,81 @@ selective retrieval, for obtaining better performance. 摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。 -##### **Neurosymbolic artificial intelligence via large language models and coherence-driven inference** -2502.13953v1 by Steve Huntsman, Jewell Thomas +##### **Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging** +2502.14064v1 by Shansong Wang, Mojtaba Safari, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang -We devise an algorithm to generate sets of propositions that objectively -instantiate graphs that support coherence-driven inference. We then benchmark -the ability of large language models (LLMs) to reconstruct coherence graphs -from (a straightforward transformation of) propositions expressed in natural -language, with promising results from a single prompt to models optimized for -reasoning. Combining coherence-driven inference with consistency evaluations by -neural models may advance the state of the art in machine cognition. +Vision foundation models (VFMs) are pre-trained on extensive image datasets +to learn general representations for diverse types of data. These models can +subsequently be fine-tuned for specific downstream tasks, significantly +boosting performance across a broad range of applications. However, existing +vision foundation models that claim to be applicable to various radiology tasks +are mostly pre-trained on 3D computed tomography (CT), which benefits from the +availability of extensive 3D CT databases. Significant differences between CT +and magnetic resonance imaging (MRI) in imaging principles, signal +characteristics, and data distribution may hinder their practical performance +and versatility in MRI-specific applications. Here, we propose Triad, a vision +foundation model for 3D MRI. Triad adopts a widely used autoencoder +architecture to learn robust representations from 131,170 3D MRI volumes and +uses organ-independent imaging descriptions to constrain the semantic +distribution of the visual modality. The above pre-training dataset is called +Triad-131K, which is currently the largest 3D MRI pre-training dataset. We +evaluate Triad across three tasks, namely, organ/tumor segmentation, +organ/cancer classification, and medical image registration, in two data +modalities (within-domain and out-of-domain) settings using 25 downstream +datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad +improves segmentation performance by 6.88% compared to nnUNet-Scratch across 17 +datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in +classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% +compared to SwinUNETR-Scratch in registration tasks across two datasets. Our +study demonstrates that pre-training can maximize performance when the data +modalities and organs of upstream and downstream tasks are consistent. -摘要:我們設計一種演算法,用來產生命題集合,以客觀地實例化支援連貫性驅動推論的圖形。接著,我們基準化大型語言模型 (LLM) 從以自然語言表達的命題(經過直接轉換)重建連貫性圖形的能力,結果顯示,單一提示就能從最佳化用於推理的模型中獲得有希望的結果。將連貫性驅動推論與神經模型的一致性評估結合起來,可能會提升機器認知的現有技術。 +摘要:視覺基礎模型 (VFM) 在廣泛的影像資料集上進行預訓練,以學習各種資料類型的通用表示。這些模型隨後可以針對特定的下游任務進行微調,大幅提升各種應用程式的效能。然而,現有的視覺基礎模型聲稱適用於各種放射學任務,但大多是針對 3D 電腦斷層攝影 (CT) 進行預訓練,這得利於廣泛的 3D CT 資料庫。CT 和磁振造影 (MRI) 在影像原理、訊號特性和資料分佈上的顯著差異,可能會阻礙其在 MRI 特定應用中的實際效能和多功能性。在此,我們提出 Triad,一個適用於 3D MRI 的視覺基礎模型。Triad 採用廣泛使用的自動編碼器架構,從 131,170 個 3D MRI 體積中學習穩健的表示,並使用與器官無關的影像描述來約束視覺模式的語義分佈。上述預訓練資料集稱為 Triad-131K,目前是最大的 3D MRI 預訓練資料集。我們在三個任務中評估 Triad,即器官/腫瘤分割、器官/癌症分類和醫學影像配準,在兩個資料模式(域內和域外)設定中使用 25 個下游資料集。透過使用 Triad 的預訓練權重初始化模型,nnUNet-Triad 在 17 個資料集中的分割效能比 nnUNet-Scratch 提升了 6.88%。Swin-B-Triad 在五個資料集的分類任務中,比 Swin-B-Scratch 提升了 3.97%。SwinUNETR-Triad 在兩個資料集的配準任務中,比 SwinUNETR-Scratch 提升了 4.00%。我們的研究證明,當上游和下游任務的資料模式和器官一致時,預訓練可以最大化效能。 -##### **Complex Ontology Matching with Large Language Model Embeddings** -2502.13619v1 by Guilherme Sousa, Rinaldo Lima, Cassia Trojahn +##### **VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare** +2502.13775v1 by Anudeex Shetty, Amin Beheshti, Mark Dras, Usman Naseem -Ontology, and more broadly, Knowledge Graph Matching is a challenging task in -which expressiveness has not been fully addressed. Despite the increasing use -of embeddings and language models for this task, approaches for generating -expressive correspondences still do not take full advantage of these models, in -particular, large language models (LLMs). This paper proposes to integrate LLMs -into an approach for generating expressive correspondences based on alignment -need and ABox-based relation discovery. The generation of correspondences is -performed by matching similar surroundings of instance sub-graphs. The -integration of LLMs results in different architectural modifications, including -label similarity, sub-graph matching, and entity matching. The performance word -embeddings, sentence embeddings, and LLM-based embeddings, was compared. The -results demonstrate that integrating LLMs surpasses all other models, enhancing -the baseline version of the approach with a 45\% increase in F-measure. +Alignment techniques have become central to ensuring that Large Language +Models (LLMs) generate outputs consistent with human values. However, existing +alignment paradigms often model an averaged or monolithic preference, failing +to account for the diversity of perspectives across cultures, demographics, and +communities. This limitation is particularly critical in health-related +scenarios, where plurality is essential due to the influence of culture, +religion, personal values, and conflicting opinions. Despite progress in +pluralistic alignment, no prior work has focused on health, likely due to the +unavailability of publicly available datasets. To address this gap, we +introduce VITAL, a new benchmark dataset comprising 13.1K value-laden +situations and 5.4K multiple-choice questions focused on health, designed to +assess and benchmark pluralistic alignment methodologies. Through extensive +evaluation of eight LLMs of varying sizes, we demonstrate that existing +pluralistic alignment techniques fall short in effectively accommodating +diverse healthcare beliefs, underscoring the need for tailored AI alignment in +specific domains. This work highlights the limitations of current approaches +and lays the groundwork for developing health-specific alignment solutions. -摘要:本体论,更广泛地说,知识图谱匹配是一项具有挑战性的任务,其中表达力尚未得到充分解决。尽管越来越多地使用嵌入和语言模型来完成此任务,但生成表达性对应关系的方法仍然没有充分利用这些模型,特别是大型语言模型 (LLM)。本文提出将 LLM 集成到一种基于对齐需求和基于 ABox 的关系发现来生成表达性对应关系的方法中。对应关系的生成是通过匹配实例子图的相似周围环境来执行的。LLM 的集成导致了不同的架构修改,包括标签相似性、子图匹配和实体匹配。比较了单词嵌入、句子嵌入和基于 LLM 的嵌入的性能。结果表明,集成 LLM 超越了所有其他模型,通过 F-measure 提高了 45% 的基准版本的方法。 +摘要:對齊技術已成為確保大型語言模型 (LLM) 產生與人類價值觀一致的輸出的核心。然而,現有的對齊範例通常會建模平均或單一的偏好,無法考量跨文化、人口統計和社群的不同觀點。此限制在與健康相關的場景中特別重要,因為在這種場景中,由於文化、宗教、個人價值觀和相互衝突的意見的影響,多元性是必要的。儘管多元對齊已取得進展,但沒有任何先前的工作專注於健康,這可能是因為缺乏公開可用的資料集。為了解決此差距,我們引入了 VITAL,這是一個新的基準資料集,包含 13.1K 個價值觀念的情境和 5.4K 個選擇題,專注於健康,旨在評估和基準多元對齊方法。透過對八個不同規模的 LLM 進行廣泛評估,我們證明現有的多元對齊技術無法有效適應不同的醫療保健信念,這強調了在特定領域中需要量身打造的 AI 對齊。這項工作突顯了當前方法的限制,並為開發特定於健康的對齊解決方案奠定了基礎。 -##### **Are Large Language Models In-Context Graph Learners?** -2502.13562v1 by Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Liang Chen, Zibin Zheng +##### **PeerQA: A Scientific Question Answering Dataset from Peer Reviews** +2502.13668v1 by Tim Baumgärtner, Ted Briscoe, Iryna Gurevych -Large language models (LLMs) have demonstrated remarkable in-context -reasoning capabilities across a wide range of tasks, particularly with -unstructured inputs such as language or images. However, LLMs struggle to -handle structured data, such as graphs, due to their lack of understanding of -non-Euclidean structures. As a result, without additional fine-tuning, their -performance significantly lags behind that of graph neural networks (GNNs) in -graph learning tasks. In this paper, we show that learning on graph data can be -conceptualized as a retrieval-augmented generation (RAG) process, where -specific instances (e.g., nodes or edges) act as queries, and the graph itself -serves as the retrieved context. Building on this insight, we propose a series -of RAG frameworks to enhance the in-context learning capabilities of LLMs for -graph learning tasks. Comprehensive evaluations demonstrate that our proposed -RAG frameworks significantly improve LLM performance on graph-based tasks, -particularly in scenarios where a pretrained LLM must be used without -modification or accessed via an API. +We present PeerQA, a real-world, scientific, document-level Question +Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, +which contain questions that reviewers raised while thoroughly examining the +scientific article. Answers have been annotated by the original authors of each +paper. The dataset contains 579 QA pairs from 208 academic articles, with a +majority from ML and NLP, as well as a subset of other scientific communities +like Geoscience and Public Health. PeerQA supports three critical tasks for +developing practical QA systems: Evidence retrieval, unanswerable question +classification, and answer generation. We provide a detailed analysis of the +collected dataset and conduct experiments establishing baseline systems for all +three tasks. Our experiments and analyses reveal the need for +decontextualization in document-level retrieval, where we find that even simple +decontextualization approaches consistently improve retrieval performance +across architectures. On answer generation, PeerQA serves as a challenging +benchmark for long-context modeling, as the papers have an average size of 12k +tokens. Our code and data is available at https://github.com/UKPLab/peerqa. -摘要:大型語言模型 (LLM) 在廣泛的任務中展示了非凡的語境推理能力,特別是對於語言或影像等非結構化輸入。然而,LLM 難以處理結構化資料,例如圖形,因為它們無法理解非歐幾何結構。因此,在沒有額外微調的情況下,它們在圖形學習任務中的表現遠遠落後於圖形神經網路 (GNN)。在本文中,我們展示了在圖形資料上學習可以被概念化為檢索增強生成 (RAG) 過程,其中特定實例(例如,節點或邊)充當查詢,而圖形本身則作為檢索的語境。基於這個見解,我們提出了一系列 RAG 架構,以增強 LLM 在圖形學習任務中的語境學習能力。全面的評估表明,我們提出的 RAG 架構顯著提升了 LLM 在基於圖形的任務上的表現,特別是在預訓練的 LLM 必須在不修改或透過 API 存取的情況下使用的場景中。 +摘要:我們提出 PeerQA,一個真實世界、科學的、文件層級的問答 (QA) 資料集。PeerQA 問題來自於同行評審,其中包含審查者在徹底審查科學文章時提出的問題。答案是由每篇論文的原始作者註解的。此資料集包含來自 208 篇學術文章的 579 個 QA 對,其中大部分來自 ML 和 NLP,以及其他科學社群(例如地球科學和公共衛生)的子集。PeerQA 支援開發實用 QA 系統的三項重要任務:證據檢索、無解答問題分類和答案產生。我們提供收集到的資料集的詳細分析,並進行實驗,為所有三項任務建立基準系統。我們的實驗和分析揭示了在文件層級檢索中去脈絡化的必要性,我們發現即使是簡單的去脈絡化方法也能持續改善跨架構的檢索效能。在答案產生方面,PeerQA 是一個用於長脈絡建模的具挑戰性基準,因為論文的平均大小為 12k 個符號。我們的程式碼和資料可於 https://github.com/UKPLab/peerqa 取得。 ##### **Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs** 2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu @@ -8056,550 +8008,583 @@ knowledge, leading to enhanced predictive performance and interpretability. 摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。 -##### **PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference** -2502.13502v1 by Burc Gokden +##### **MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis** +2502.13524v1 by Wei Dai, Steven Wang, Jun Liu -We show that Large Language Model from Power Law Decoder Representations -(PLDR-LLM) is a foundational model whose deductive outputs are invariant -tensors up to a small perturbation. PLDR-LLM learns a singularity condition for -the deductive outputs that enable the once-inferred energy-curvature tensor -$\mathbf{G}_{LM}$ to replace the deep neural network of power law graph -attention (PLGA) generating the deductive outputs at inference. We demonstrate -that a cache for $\mathbf{G}_{LM}$ (G-cache) and KV-cache can be implemented in -a straightforward manner to improve the inference time. The invariance and -generalizable nature of deductive outputs is at a very high fidelity where -deductive outputs have same RMSE and determinant values up to 15 decimal places -after caching, and zero-shot benchmark scores remain unchanged. Ablation -studies show that learned deductive outputs have distinct loss and accuracy -characteristics from models pretrained with transferred, randomly initialized -or identity tensors as a constant tensor operator and an LLM with scaled-dot -product attention (SDPA) is a special case of PLDR-LLM where $\mathbf{G}_{LM}$ -is predefined as identity. The observed invariance characteristic introduces a -novel asymmetry between training and inference phases with caching. We outline -observed common characteristics of the deductive outputs for the learned -singularity condition. We provide an implementation of a training and inference -framework for PLDR-LLM with KV-cache and G-cache. +Efficient evaluation of three-dimensional (3D) medical images is crucial for +diagnostic and therapeutic practices in healthcare. Recent years have seen a +substantial uptake in applying deep learning and computer vision to analyse and +interpret medical images. Traditional approaches, such as convolutional neural +networks (CNNs) and vision transformers (ViTs), face significant computational +challenges, prompting the need for architectural advancements. Recent efforts +have led to the introduction of novel architectures like the ``Mamba'' model as +alternative solutions to traditional CNNs or ViTs. The Mamba model excels in +the linear processing of one-dimensional data with low computational demands. +However, Mamba's potential for 3D medical image analysis remains underexplored +and could face significant computational challenges as the dimension increases. +This manuscript presents MobileViM, a streamlined architecture for efficient +segmentation of 3D medical images. In the MobileViM network, we invent a new +dimension-independent mechanism and a dual-direction traversing approach to +incorporate with a vision-Mamba-based framework. MobileViM also features a +cross-scale bridging technique to improve efficiency and accuracy across +various medical imaging modalities. With these enhancements, MobileViM achieves +segmentation speeds exceeding 90 frames per second (FPS) on a single graphics +processing unit (i.e., NVIDIA RTX 4090). This performance is over 24 FPS faster +than the state-of-the-art deep learning models for processing 3D images with +the same computational resources. In addition, experimental evaluations +demonstrate that MobileViM delivers superior performance, with Dice similarity +scores reaching 92.72%, 86.69%, 80.46%, and 77.43% for PENGWIN, BraTS2024, +ATLAS, and Toothfairy2 datasets, respectively, which significantly surpasses +existing models. -摘要:我們展示了來自冪律解碼器表示 (PLDR-LLM) 的大型語言模型是一個基礎模型,其演繹輸出是直到一個小擾動的不變張量。PLDR-LLM 學習演繹輸出的奇異條件,使曾經推斷出的能量曲率張量 $\mathbf{G}_{LM}$ 能夠取代產生演繹輸出的冪律圖注意力 (PLGA) 深度神經網路,進行推論。我們證明了 $\mathbf{G}_{LM}$ 快取 (G 快取) 和 KV 快取能夠以一種直接的方式實作,以改善推論時間。演繹輸出的不變性和可概化性質具有非常高的保真度,其中演繹輸出在快取後具有相同的 RMSE 和行列式值,直到小數點後 15 位,且零次學習基準分數保持不變。消融研究表明,學習的演繹輸出具有與使用轉移、隨機初始化或恆等張量作為常數張量算子和具有縮放點積注意力的 LLM 預先訓練的模型不同的損失和準確性特徵,並且 $\mathbf{G}_{LM}$ 被預先定義為恆等的 PLDR-LLM 的一個特例,其中 $\mathbf{G}_{LM}$ 被預先定義為恆等。觀察到的不變特徵引入了訓練和推論階段之間一個新的不對稱性,並帶有快取。我們概述了學習的奇異條件演繹輸出的觀察到的共同特徵。我們提供了一個具有 KV 快取和 G 快取的 PLDR-LLM 訓練和推論框架的實作。 +摘要:有效評估三維 (3D) 醫學影像對於醫療保健中的診斷和治療實務至關重要。近年來,將深度學習和電腦視覺應用於分析和詮釋醫學影像的應用大幅增加。傳統方法,例如卷積神經網路 (CNN) 和視覺Transformer (ViT),面臨重大的運算挑戰,促使需要架構上的進步。最近的努力已導致引進創新的架構,例如「Mamba」模型,作為傳統 CNN 或 ViT 的替代解決方案。Mamba 模型擅長以低運算需求進行一維資料的線性處理。然而,Mamba 在 3D 醫學影像分析方面的潛力仍未被充分探索,並且隨著維度的增加可能會面臨重大的運算挑戰。本手稿提出 MobileViM,這是一種簡化的架構,可有效分割 3D 醫學影像。在 MobileViM 網路中,我們發明了一種新的與維度無關的機制和雙向遍歷方法,以與基於視覺 Mamba 的架構結合。MobileViM 還具備跨尺度橋接技術,以提高各種醫學影像模式的效率和準確性。透過這些增強功能,MobileViM 在單一顯示卡 (即 NVIDIA RTX 4090) 上達到了每秒超過 90 幀 (FPS) 的分割速度。此效能比現有最先進的深度學習模型快了超過 24 FPS,這些模型使用相同的運算資源處理 3D 影像。此外,實驗評估證明 MobileViM 提供了卓越的效能,Dice 相似性評分對於 PENGWIN、BraTS2024、ATLAS 和 Toothfairy2 資料集分別達到 92.72%、86.69%、80.46% 和 77.43%,顯著超越現有模型。 -##### **Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction** -2502.13412v1 by Yanbang Sun, Qing Huang, Xiaoxue Ren, Zhenchang Xing, Xiaohong Li, Junjie Wang +##### **Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion** +2502.13509v1 by Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang -The API Knowledge Graph (API KG) is a structured network that models API -entities and their relations, providing essential semantic insights for tasks -such as API recommendation, code generation, and API misuse detection. However, -constructing a knowledge-rich and reliable API KG presents several challenges. -Existing schema-based methods rely heavily on manual annotations to design KG -schemas, leading to excessive manual overhead. On the other hand, schema-free -methods, due to the lack of schema guidance, are prone to introducing noise, -reducing the KG's reliability. To address these issues, we propose the -Explore-Construct-Filter framework, an automated approach for API KG -construction based on large language models (LLMs). This framework consists of -three key modules: 1) KG exploration: LLMs simulate the workflow of annotators -to automatically design a schema with comprehensive type triples, minimizing -human intervention; 2) KG construction: Guided by the schema, LLMs extract -instance triples to construct a rich yet unreliable API KG; 3) KG filtering: -Removing invalid type triples and suspicious instance triples to construct a -rich and reliable API KG. Experimental results demonstrate that our method -surpasses the state-of-the-art method, achieving a 25.2% improvement in F1 -score. Moreover, the Explore-Construct-Filter framework proves effective, with -the KG exploration module increasing KG richness by 133.6% and the KG filtering -module improving reliability by 26.6%. Finally, cross-model experiments confirm -the generalizability of our framework. +Large language models (LLMs) have shown remarkable performance in +vision-language tasks, but their application in the medical field remains +underexplored, particularly for integrating structured time series data with +unstructured clinical notes. In clinical practice, dynamic time series data +such as lab test results capture critical temporal patterns, while clinical +notes provide rich semantic context. Merging these modalities is challenging +due to the inherent differences between continuous signals and discrete text. +To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal +framework that employs prompt-guided learning to unify these heterogeneous data +types. Our approach leverages lightweight anomaly detection to generate anomaly +captions that serve as prompts, guiding the encoding of raw time series data +into informative embeddings. These embeddings are aligned with textual +representations in a shared latent space, preserving fine-grained temporal +nuances alongside semantic insights. Furthermore, our framework incorporates +tailored self-supervised objectives to enhance both intra- and inter-modal +alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world +datasets, and the results demonstrate that our method consistently outperforms +state-of-the-art approaches. -摘要:API 知識圖譜 (API KG) 是一個結構化網路,用於建模 API 實體及其關係,提供基本語義見解,以執行 API 建議、程式碼產生和 API 誤用偵測等任務。然而,建構一個知識豐富且可靠的 API KG 會產生若干挑戰。現有的基於架構的方法嚴重依賴手動註解來設計 KG 架構,導致過度的手動開銷。另一方面,由於缺乏架構指導,無架構的方法容易引入雜訊,降低 KG 的可靠性。為了解決這些問題,我們提出了探索建構過濾架構,這是一種基於大型語言模型 (LLM) 的自動化 API KG 建構方法。此架構包含三個關鍵模組:1) KG 探索:LLM 模擬註解者的工作流程,自動設計具有完整類型三元組的架構,將人為干預降至最低;2) KG 建構:在架構的指導下,LLM 提取實例三元組來建構豐富但不可靠的 API KG;3) KG 過濾:移除無效的類型三元組和可疑的實例三元組,以建構豐富且可靠的 API KG。實驗結果表明,我們的方法優於最先進的方法,在 F1 分數上提高了 25.2%。此外,探索建構過濾架構被證明是有效的,其中 KG 探索模組將 KG 豐富度提高了 133.6%,而 KG 過濾模組將可靠性提高了 26.6%。最後,跨模型實驗證實了我們架構的泛化性。 +摘要:大型語言模型(LLM)在視覺語言任務中表現出色,但其在醫療領域的應用仍未得到充分探索,特別是在將結構化時間序列數據與非結構化臨床筆記整合方面。在臨床實務中,動態時間序列數據(例如實驗室檢驗結果)會擷取關鍵的時間模式,而臨床筆記則提供豐富的語意脈絡。由於連續訊號與離散文字之間的固有差異,合併這些方式具有挑戰性。為了彌補這個差距,我們引入了 ProMedTS,這是一個新穎的自監督多模態框架,採用提示引導學習來統一這些異質化的數據類型。我們的做法利用輕量級異常偵測來產生異常標題,作為提示,引導將原始時間序列數據編碼成資訊性的嵌入。這些嵌入與共享潛在空間中的文字表示對齊,同時保留細微的時間差異和語意見解。此外,我們的框架納入了客製化的自監督目標,以增強模態內和模態間對齊。我們在疾病診斷任務中使用真實世界的數據集評估 ProMedTS,結果表明,我們的模型始終優於最先進的方法。 -##### **Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval** -2502.13369v1 by Aditya Sharma, Luis Lara, Amal Zouaq, Christopher J. Pal +##### **Towards a perturbation-based explanation for medical AI as differentiable programs** +2502.14001v1 by Takeshi Abe, Yoshiyuki Asai -The ability to generate SPARQL queries from natural language questions is -crucial for ensuring efficient and accurate retrieval of structured data from -knowledge graphs (KG). While large language models (LLMs) have been widely -adopted for SPARQL query generation, they are often susceptible to -hallucinations and out-of-distribution errors when producing KG elements like -Uniform Resource Identifiers (URIs) based on internal parametric knowledge. -This often results in content that appears plausible but is factually -incorrect, posing significant challenges for their use in real-world -information retrieval (IR) applications. This has led to increased research -aimed at detecting and mitigating such errors. In this paper, we introduce PGMR -(Post-Generation Memory Retrieval), a modular framework that incorporates a -non-parametric memory module to retrieve KG elements and enhance LLM-based -SPARQL query generation. Our experimental results indicate that PGMR -consistently delivers strong performance across diverse datasets, data -distributions, and LLMs. Notably, PGMR significantly mitigates URI -hallucinations, nearly eliminating the problem in several scenarios. +Recent advancement in machine learning algorithms reaches a point where +medical devices can be equipped with artificial intelligence (AI) models for +diagnostic support and routine automation in clinical settings. In medicine and +healthcare, there is a particular demand for sufficient and objective +explainability of the outcome generated by AI models. However, AI models are +generally considered as black boxes due to their complexity, and the +computational process leading to their response is often opaque. Although +several methods have been proposed to explain the behavior of models by +evaluating the importance of each feature in discrimination and prediction, +they may suffer from biases and opacities arising from the scale and sampling +protocol of the dataset used for training or testing. To overcome the +shortcomings of existing methods, we explore an alternative approach to provide +an objective explanation of AI models that can be defined independently of the +learning process and does not require additional data. As a preliminary study +for this direction of research, this work examines a numerical availability of +the Jacobian matrix of deep learning models that measures how stably a model +responses against small perturbations added to the input. The indicator, if +available, are calculated from a trained AI model for a given target input. +This is a first step towards a perturbation-based explanation, which will +assist medical practitioners in understanding and interpreting the response of +the AI model in its clinical application. -摘要:從自然語言問題中產生 SPARQL 查詢的能力對於確保從知識圖譜 (KG) 中有效率且準確地擷取結構化資料至關重要。儘管大型語言模型 (LLM) 已廣泛用於 SPARQL 查詢產生,但它們在根據內部參數化知識產生像統一資源識別碼 (URI) 等 KG 元素時,通常容易出現幻覺和分布外錯誤。這通常會導致內容看似合理,但事實上並不正確,對其在真實世界資訊檢索 (IR) 應用中的使用構成重大挑戰。這導致針對偵測和減輕此類錯誤的研究增加。在本文中,我們介紹 PGMR(後產生記憶體檢索),這是一個模組化架構,它結合了一個非參數記憶體模組來檢索 KG 元素並增強基於 LLM 的 SPARQL 查詢產生。我們的實驗結果表明,PGMR 在不同的資料集、資料分佈和 LLM 中始終提供強大的效能。值得注意的是,PGMR 大幅減輕了 URI 幻覺,在許多情況下幾乎消除了問題。 +摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。 -##### **Craw4LLM: Efficient Web Crawling for LLM Pretraining** -2502.13347v1 by Shi Yu, Zhiyuan Liu, Chenyan Xiong +##### **RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering** +2502.13361v1 by Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou -Web crawl is a main source of large language models' (LLMs) pretraining data, -but the majority of crawled web pages are discarded in pretraining due to low -data quality. This paper presents Crawl4LLM, an efficient web crawling method -that explores the web graph based on the preference of LLM pretraining. -Specifically, it leverages the influence of a webpage in LLM pretraining as the -priority score of the web crawler's scheduler, replacing the standard graph -connectivity based priority. Our experiments on a web graph containing 900 -million webpages from a commercial search engine's index demonstrate the -efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just -21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream -performances of previous crawls, significantly reducing the crawling waste and -alleviating the burdens on websites. Our code is publicly available at -https://github.com/cxcscmu/Crawl4LLM. +Medical question answering requires extensive access to specialized +conceptual knowledge. The current paradigm, Retrieval-Augmented Generation +(RAG), acquires expertise medical knowledge through large-scale corpus +retrieval and uses this knowledge to guide a general-purpose large language +model (LLM) for generating answers. However, existing retrieval approaches +often overlook the importance of factual knowledge, which limits the relevance +of retrieved conceptual knowledge and restricts its applicability in real-world +scenarios, such as clinical decision-making based on Electronic Health Records +(EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval +framework that retrieves both relevant factual and conceptual knowledge from +dual sources (i.e., EHRs and the corpus), allowing them to interact and refine +each another. Through extensive evaluation across three factual-aware medical +question answering benchmarks, RGAR establishes a new state-of-the-art +performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model +with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings +demonstrate the benefit of extracting factual knowledge for retrieval, which +consistently yields improved generation quality. -摘要:網路爬蟲是大型語言模型 (LLM) 預訓練資料的主要來源, -但大多數已爬取的網頁在預訓練中會因為資料品質低落而被捨棄。 -本文提出 Crawl4LLM,這是一種有效率的網路爬取方法, -它會根據 LLM 預訓練的偏好來探索網路圖。 -具體來說,它利用網頁在 LLM 預訓練中的影響力作為網路爬蟲排程器的優先分數, -取代標準的圖形連線優先順序。 -我們在一個包含來自商業搜尋引擎索引的 9 億個網頁的網路圖上進行的實驗, -證明了 Crawl4LLM 在取得高品質預訓練資料方面的效率。 -只爬取了 21% 的網址,以 Crawl4LLM 資料預訓練的 LLM 就達到了先前爬取的相同下游效能, -大幅減少了爬取浪費,並減輕了對網站的負擔。 -我們的程式碼已公開於 https://github.com/cxcscmu/Crawl4LLM。 +摘要:醫療問題解答需要大量取得專業概念知識。目前的典範,檢索增強生成(RAG),透過大規模語料庫檢索取得專業醫療知識,並使用此知識引導通用大型語言模型(LLM)來產生答案。然而,現有的檢索方法經常忽略事實知識的重要性,這會限制檢索到的概念知識的相關性,並限制其在現實世界情境中的適用性,例如基於電子健康記錄(EHR)的臨床決策制定。本文介紹 RGAR,一個遞迴生成增強檢索架構,從雙重來源(即 EHR 和語料庫)檢索相關的事實和概念知識,讓它們互動並互相精煉。透過在三個事實感知醫療問題解答基準上進行廣泛評估,RGAR 在醫療 RAG 系統中建立了新的最先進效能。值得注意的是,採用 RGAR 的 Llama-3.1-8B-Instruct 模型超越了規模大得多的 RAG 增強型 GPT-3.5。我們的研究結果證明了提取事實知識以進行檢索的好處,這會持續產生改善的生成品質。 -##### **K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction** -2502.13344v1 by Tassallah Abdullahi, Ioanna Gemou, Nihal V. Nayak, Ghulam Murtaza, Stephen H. Bach, Carsten Eickhoff, Ritambhara Singh +##### **Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance** +2502.13321v1 by Tejas Srinivasan, Jesse Thomason -Drug discovery is a complex and time-intensive process that requires -identifying and validating new therapeutic candidates. Computational approaches -using large-scale biomedical knowledge graphs (KGs) offer a promising solution -to accelerate this process. However, extracting meaningful insights from -large-scale KGs remains challenging due to the complexity of graph traversal. -Existing subgraph-based methods are tailored to graph neural networks (GNNs), -making them incompatible with other models, such as large language models -(LLMs). We introduce K-Paths, a retrieval framework that extracts structured, -diverse, and biologically meaningful paths from KGs. Integrating these paths -enables LLMs and GNNs to effectively predict unobserved drug-drug and -drug-disease interactions. Unlike traditional path-ranking approaches, K-Paths -retrieves and transforms paths into a structured format that LLMs can directly -process, facilitating explainable reasoning. K-Paths employs a diversity-aware -adaptation of Yen's algorithm to retrieve the K shortest loopless paths between -entities in an interaction query, prioritizing biologically relevant and -diverse relationships. Our experiments on benchmark datasets show that K-Paths -improves the zero-shot performance of Llama 8.1B's F1-score by 12.45 points on -drug repurposing and 13.42 points on interaction severity prediction. We also -show that Llama 70B achieves F1-score gains of 6.18 and 8.46 points, -respectively. K-Paths also improves the supervised training efficiency of -EmerGNN, a state-of-the-art GNN, by reducing KG size by 90% while maintaining -strong predictive performance. Beyond its scalability and efficiency, K-Paths -uniquely bridges the gap between KGs and LLMs, providing explainable rationales -for predicted interactions. These capabilities show that K-Paths is a valuable -tool for efficient data-driven drug discovery. +Trust biases how users rely on AI recommendations in AI-assisted +decision-making tasks, with low and high levels of trust resulting in increased +under- and over-reliance, respectively. We propose that AI assistants should +adapt their behavior through trust-adaptive interventions to mitigate such +inappropriate reliance. For instance, when user trust is low, providing an +explanation can elicit more careful consideration of the assistant's advice by +the user. In two decision-making scenarios -- laypeople answering science +questions and doctors making medical diagnoses -- we find that providing +supporting and counter-explanations during moments of low and high trust, +respectively, yields up to 38% reduction in inappropriate reliance and 20% +improvement in decision accuracy. We are similarly able to reduce over-reliance +by adaptively inserting forced pauses to promote deliberation. Our results +highlight how AI adaptation to user trust facilitates appropriate reliance, +presenting exciting avenues for improving human-AI collaboration. -摘要:藥物發現是一個複雜且耗時的過程,需要識別和驗證新的治療候選藥物。使用大型生物醫學知識圖譜 (KG) 的計算方法提供了一個有希望的解決方案來加速這個過程。然而,由於圖形遍歷的複雜性,從大型 KG 中提取有意義的見解仍然具有挑戰性。現有的子圖方法是針對圖神經網路 (GNN) 量身打造的,這使得它們與其他模型(例如大型語言模型 (LLM))不兼容。我們介紹了 K-Paths,這是一個檢索框架,它從 KG 中提取結構化、多樣化且具有生物意義的路徑。整合這些路徑使 LLM 和 GNN 能夠有效預測未觀察到的藥物-藥物和藥物-疾病交互。與傳統的路徑排序方法不同,K-Paths 檢索路徑並將其轉換為 LLM 可以直接處理的結構化格式,從而促進可解釋的推理。K-Paths 採用了 Yen 演算法的多樣性感知適應,以檢索交互查詢中實體之間的 K 個最短無環路徑,優先考慮生物相關且多樣化的關係。我們在基準資料集上的實驗表明,K-Paths 將 Llama 8.1B 的 F1 分數在藥物再利用上提高了 12.45 分,在交互嚴重性預測上提高了 13.42 分。我們還表明,Llama 70B 分別獲得了 6.18 分和 8.46 分的 F1 分數增益。K-Paths 還提高了最先進的 GNN EmerGNN 的監督訓練效率,同時將 KG 大小減少了 90%,同時保持強大的預測性能。除了其可擴展性和效率之外,K-Paths 獨特地彌合了 KG 和 LLM 之間的差距,為預測的交互提供了可解釋的依據。這些功能表明,K-Paths 是用於高效資料驅動藥物發現的寶貴工具。 +摘要:信任偏見影響使用者在 AI 輔助決策任務中如何依賴 AI 建議,信任程度低和高分別導致依賴不足和過度依賴。我們建議 AI 助理應透過信任適應式干預調整其行為,以減輕這種不適當的依賴。例如,當使用者信任度低時,提供解釋可以引發使用者更仔細地考慮助理的建議。在兩種決策情境中——外行人回答科學問題和醫生進行醫療診斷——我們發現,分別在信任度低和高的時刻提供支持性和反向解釋,可以將不適當的依賴降低多達 38%,並將決策準確性提高 20%。我們同樣能夠透過適應性地插入強制暫停來促進審議,以減少過度依賴。我們的結果強調 AI 如何適應使用者信任以促進適當的依賴,為改善人機協作提供了令人興奮的途徑。 -##### **Grounding LLM Reasoning with Knowledge Graphs** -2502.13247v1 by Alfonso Amayuelas, Joy Sain, Simerjot Kaur, Charese Smiley +##### **Prediction of Clinical Complication Onset using Neural Point Processes** +2502.13290v1 by Sachini Weerasekara, Sagar Kamarthi, Jacqueline Isaacs -Knowledge Graphs (KGs) are valuable tools for representing relationships -between entities in a structured format. Traditionally, these knowledge bases -are queried to extract specific information. However, question-answering (QA) -over such KGs poses a challenge due to the intrinsic complexity of natural -language compared to the structured format and the size of these graphs. -Despite these challenges, the structured nature of KGs can provide a solid -foundation for grounding the outputs of Large Language Models (LLMs), offering -organizations increased reliability and control. - Recent advancements in LLMs have introduced reasoning methods at inference -time to improve their performance and maximize their capabilities. In this -work, we propose integrating these reasoning strategies with KGs to anchor -every step or "thought" of the reasoning chains in KG data. Specifically, we -evaluate both agentic and automated search methods across several reasoning -strategies, including Chain-of-Thought (CoT), Tree-of-Thought (ToT), and -Graph-of-Thought (GoT), using GRBench, a benchmark dataset for graph reasoning -with domain-specific graphs. Our experiments demonstrate that this approach -consistently outperforms baseline models, highlighting the benefits of -grounding LLM reasoning processes in structured KG data. +Predicting medical events in advance within critical care settings is +paramount for patient outcomes and resource management. Utilizing predictive +models, healthcare providers can anticipate issues such as cardiac arrest, +sepsis, or respiratory failure before they manifest. Recently, there has been a +surge in research focusing on forecasting adverse medical event onsets prior to +clinical manifestation using machine learning. However, while these models +provide temporal prognostic predictions for the occurrence of a specific +adverse event of interest within defined time intervals, their interpretability +often remains a challenge. In this work, we explore the applicability of neural +temporal point processes in the context of adverse event onset prediction, with +the aim of explaining clinical pathways and providing interpretable insights. +Our experiments span six state-of-the-art neural point processes and six +critical care datasets, each focusing on the onset of distinct adverse events. +This work represents a novel application class of neural temporal point +processes in event prediction. -摘要:知識圖譜 (KG) 是以結構化格式表示實體之間關係的寶貴工具。傳統上,這些知識庫會被查詢以萃取特定資訊。然而,由於自然語言與結構化格式之間的內在複雜性,以及這些圖譜的規模,在這些 KG 上進行問答 (QA) 會構成挑戰。儘管有這些挑戰,KG 的結構化特性可以為大型語言模型 (LLM) 的輸出提供穩固的基礎,為組織提供更高的可靠性和控制力。 -LLM 的最新進展在推論時間引入了推理方法,以提升其效能並最大化其能力。在這項工作中,我們建議將這些推理策略與 KG 整合,以將推理鏈的每一步或「思考」錨定在 KG 資料中。具體來說,我們在多種推理策略中評估代理和自動化搜尋方法,包括思考鏈 (CoT)、思考樹 (ToT) 和思考圖 (GoT),使用 GRBench,這是一個針對圖形推理的基準資料集,其中包含特定領域的圖形。我們的實驗證明,這種方法始終優於基準模型,突顯了將 LLM 推理過程建立在結構化 KG 資料中的好處。 +摘要:在重症監護環境中預先預測醫療事件對於患者的預後和資源管理至關重要。利用預測模型,醫療保健提供者可以在心臟驟停、敗血症或呼吸衰竭等問題發生之前預測到這些問題。最近,專注於在臨床表現之前使用機器學習預測不良醫療事件發生的研究激增。然而,儘管這些模型為特定不良事件在定義的時間間隔內發生提供了時間預後預測,但它們的可解釋性仍然是一個挑戰。在這項工作中,我們探討了神經時間點過程在不良事件發作預測中的適用性,目的是解釋臨床途徑並提供可解釋的見解。我們的實驗涵蓋了六種最先進的神經點過程和六個重症監護資料集,每個資料集都專注於不同不良事件的發作。這項工作代表了神經時間點過程在事件預測中的一種新的應用類別。 -##### **Learning to Defer for Causal Discovery with Imperfect Experts** -2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin +##### **SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?** +2502.13233v1 by Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu -Integrating expert knowledge, e.g. from large language models, into causal -discovery algorithms can be challenging when the knowledge is not guaranteed to -be correct. Expert recommendations may contradict data-driven results, and -their reliability can vary significantly depending on the domain or specific -query. Existing methods based on soft constraints or inconsistencies in -predicted causal relationships fail to account for these variations in -expertise. To remedy this, we propose L2D-CD, a method for gauging the -correctness of expert recommendations and optimally combining them with -data-driven causal discovery results. By adapting learning-to-defer (L2D) -algorithms for pairwise causal discovery (CD), we learn a deferral function -that selects whether to rely on classical causal discovery methods using -numerical data or expert recommendations based on textual meta-data. We -evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its -superior performance compared to both the causal discovery method and the -expert used in isolation. Moreover, our approach identifies domains where the -expert's performance is strong or weak. Finally, we outline a strategy for -generalizing this approach to causal discovery on graphs with more than two -variables, paving the way for further research in this area. +Large Language Models (LLMs) have shown remarkable capabilities in general +domains but often struggle with tasks requiring specialized knowledge. +Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve +external information from static knowledge bases, which can be outdated or +incomplete, missing fine-grained clinical details essential for accurate +medical question answering. In this work, we propose SearchRAG, a novel +framework that overcomes these limitations by leveraging real-time search +engines. Our method employs synthetic query generation to convert complex +medical questions into search-engine-friendly queries and utilizes +uncertainty-based knowledge selection to filter and incorporate the most +relevant and informative medical knowledge into the LLM's input. Experimental +results demonstrate that our method significantly improves response accuracy in +medical question answering tasks, particularly for complex questions requiring +detailed and up-to-date knowledge. -摘要:整合专家知識,例如從大型語言模型中整合到因果發現演算法中,當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾,而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點,我們提出了 L2D-CD,一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD),我們學習了一個延遲函數,用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD,並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外,我們的做法識別出專家表現強或弱的領域。最後,我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略,為此領域的進一步研究鋪平了道路。 +摘要:大型語言模型 (LLM) 在一般領域展現出驚人的能力,但經常在需要專業知識的任務中掙扎。 +傳統的檢索增強生成 (RAG) 技術通常從靜態知識庫中檢索外部資訊,這些資訊可能過時或不完整,缺少準確回答醫療問題所需的細微臨床細節。在這項工作中,我們提出 SearchRAG,這是一種新穎的架構,透過利用即時搜尋引擎克服這些限制。我們的模型採用合成查詢生成,將複雜的醫療問題轉換成搜尋引擎友善的查詢,並利用基於不確定性的知識選擇來過濾和納入 LLM 輸入中最相關且最有資訊的醫療知識。實驗結果證明,我們的模型顯著改善了醫療問題回答任務中的回應準確度,特別是需要詳細且最新的知識的複雜問題。 -##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks** -2502.13025v1 by Markus J. Buehler +##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions** +2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić -We present an agentic, autonomous graph expansion framework that iteratively -structures and refines knowledge in situ. Unlike conventional knowledge graph -construction methods relying on static extraction or single-pass learning, our -approach couples a reasoning-native large language model with a continually -updated graph representation. At each step, the system actively generates new -concepts and relationships, merges them into a global graph, and formulates -subsequent prompts based on its evolving structure. Through this -feedback-driven loop, the model organizes information into a scale-free network -characterized by hub formation, stable modularity, and bridging nodes that link -disparate knowledge clusters. Over hundreds of iterations, new nodes and edges -continue to appear without saturating, while centrality measures and shortest -path distributions evolve to yield increasingly distributed connectivity. Our -analysis reveals emergent patterns, such as the rise of highly connected 'hub' -concepts and the shifting influence of 'bridge' nodes, indicating that agentic, -self-reinforcing graph construction can yield open-ended, coherent knowledge -structures. Applied to materials design problems, we present compositional -reasoning experiments by extracting node-specific and synergy-level principles -to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that -transcend rote summarization and strengthen the framework's potential for -open-ended scientific discovery. We discuss other applications in scientific -discovery and outline future directions for enhancing scalability and -interpretability. +We present an end-to-end framework for generating synthetic users for +evaluating interactive agents designed to encourage positive behavior changes, +such as in health and lifestyle coaching. The synthetic users are grounded in +health and lifestyle conditions, specifically sleep and diabetes management in +this study, to ensure realistic interactions with the health coaching agent. +Synthetic users are created in two stages: first, structured data are generated +grounded in real-world health and lifestyle factors in addition to basic +demographics and behavioral attributes; second, full profiles of the synthetic +users are developed conditioned on the structured data. Interactions between +synthetic users and the coaching agent are simulated using generative +agent-based models such as Concordia, or directly by prompting a language +model. Using two independently-developed agents for sleep and diabetes coaching +as case studies, the validity of this framework is demonstrated by analyzing +the coaching agent's understanding of the synthetic users' needs and +challenges. Finally, through multiple blinded evaluations of user-coach +interactions by human experts, we demonstrate that our synthetic users with +health and behavioral attributes more accurately portray real human users with +the same attributes, compared to generic synthetic users not grounded in such +attributes. The proposed framework lays the foundation for efficient +development of conversational agents through extensive, realistic, and grounded +simulated interactions. -摘要:我們提出一個能動的、自主的圖形擴展框架,它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同,我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中,系統主動產生新的概念和關係,將它們合併到一個全域圖形中,並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈,模型將資訊組織成一個無標度網路,其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中,新的節點和邊緣會持續出現,而不會飽和,同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式,例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移,這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題,我們提出組合推理實驗,透過提取特定於節點的原則和協同效應層級原則,以促進真正新穎的知識綜合,產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用,並概述了增強可擴充性和可解釋性的未來方向。 +摘要:我們提供了一個端到端的架構,用於為評估互動式代理生成合成使用者,這些代理旨在鼓勵正向行為改變,例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎,特別是本研究中的睡眠和糖尿病管理,以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立:首先,除了基本人口統計資料和行為屬性外,還會產生以現實世界的健康和生活方式因素為基礎的結構化資料;其次,會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型(例如 Concordia)模擬的,或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究,通過分析指導代理對合成使用者需求和挑戰的理解,證明了此架構的有效性。最後,通過人類專家對使用者指導互動進行多重盲測評估,我們證明了與未以這些屬性為基礎的通用合成使用者相比,具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動,為對話代理的有效開發奠定了基礎。 -##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge** -2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany +##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization** +2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar -Large Language Models (LLMs) have significantly advanced medical -question-answering by leveraging extensive clinical data and medical -literature. However, the rapid evolution of medical knowledge and the -labor-intensive process of manually updating domain-specific resources pose -challenges to the reliability of these systems. To address this, we introduce -Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates -the construction and continuous updating of medical knowledge graphs, -integrates reasoning, and retrieves current external evidence, such as PubMed -and WikiSearch. By dynamically linking new findings and complex medical -concepts, AMG-RAG not only improves accuracy but also enhances interpretability -in medical queries. - Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness -of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of -66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to -100 times larger. Notably, these improvements are achieved without increasing -computational overhead, highlighting the critical role of automated knowledge -graph generation and external evidence retrieval in delivering up-to-date, -trustworthy medical insights. +Clinical Question Answering (CQA) plays a crucial role in medical +decision-making, enabling physicians to extract relevant information from +Electronic Medical Records (EMRs). While transformer-based models such as BERT, +BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in +CQA, existing models lack the ability to categorize extracted answers, which is +critical for structured retrieval, content filtering, and medical decision +support. + To address this limitation, we introduce a Multi-Task Learning (MTL) +framework that jointly trains CQA models for both answer extraction and medical +categorization. In addition to predicting answer spans, our model classifies +responses into five standardized medical categories: Diagnosis, Medication, +Symptoms, Procedure, and Lab Reports. This categorization enables more +structured and interpretable outputs, making clinical QA models more useful in +real-world healthcare settings. + We evaluate our approach on emrQA, a large-scale dataset for medical question +answering. Results show that MTL improves F1-score by 2.2% compared to standard +fine-tuning, while achieving 90.7% accuracy in answer categorization. These +findings suggest that MTL not only enhances CQA performance but also introduces +an effective mechanism for categorization and structured medical information +retrieval. -摘要:大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻,大幅提升了醫療問題解答的進步。然而,醫療知識的快速演進和手動更新特定領域資源的繁複程序,對這些系統的可靠性構成挑戰。為了解決這個問題,我們引入了適應性醫療圖表 RAG (AMG-RAG),這是一個自動化建構和持續更新醫療知識圖表的綜合架構,整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念,AMG-RAG 不僅提升了準確性,也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性,在 MEDQA 上達到了 74.1% 的 F1 分數,在 MEDMCQA 上達到了 66.34% 的準確度,優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是,這些改進是在不增加運算負擔的情況下實現的,突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。 +摘要:臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色,讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能,但現有的模型缺乏分類擷取答案的能力,這對於結構化檢索、內容過濾和醫療決策支援至關重要。 + 為了解決這個限制,我們引進了一個多任務學習 (MTL) 架構,它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍,我們的模型將回應分類為五個標準化醫療類別:診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出,讓臨床問答模型在真實世界的醫療保健環境中更實用。 + 我們在 emrQA 上評估我們的做法,emrQA 是用於醫療問題解答的大規模資料集。結果顯示,與標準微調相比,MTL 將 F1 分數提高了 2.2%,同時在答案分類中達到 90.7% 的準確度。這些發現表明,MTL 不僅增強了 CQA 的效能,還引入了一種分類和結構化醫療資訊檢索的有效機制。 -##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs** -2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi +##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection** +2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert -Recent studies have combined Large Language Models (LLMs) with Knowledge -Graphs (KGs) to enhance reasoning, improving inference accuracy without -additional training while mitigating hallucination. However, existing -frameworks are often rigid, struggling to adapt to KG or task changes. They -also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. -To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that -separates reasoning into two roles: an Operator (a low-capacity LLM) that -gathers evidence and a Supervisor (a high-capacity LLM) that makes final -judgments. This design is cost-efficient for LLM inference while still -maintaining strong reasoning accuracy. Additionally, R2-KG employs an -Abstention mechanism, generating answers only when sufficient evidence is -collected from KG, which significantly enhances reliability. Experiments across -multiple KG-based reasoning tasks show that R2-KG consistently outperforms -baselines in both accuracy and reliability, regardless of the inherent -capability of LLMs used as the Operator. Further experiments reveal that the -single-agent version of R2-KG, equipped with a strict self-consistency -strategy, achieves significantly higher-than-baseline reliability while -reducing inference cost. However, it also leads to a higher abstention rate in -complex KGs. Our findings establish R2-KG as a flexible and cost-effective -solution for KG-based reasoning. It reduces reliance on high-capacity LLMs -while ensuring trustworthy inference. +Detection of hyperenhancement from cardiac LGE MRI images is a complex task +requiring significant clinical expertise. Although deep learning-based models +have shown promising results for the task, they require large amounts of data +with fine-grained annotations. Clinical reports generated for cardiac MR +studies contain rich, clinically relevant information, including the location, +extent and etiology of any scars present. Although recently developed +CLIP-based training enables pretraining models with image-text pairs, it +requires large amounts of data and further finetuning strategies on downstream +tasks. In this study, we use various strategies rooted in domain knowledge to +train a model for LGE detection solely using text from clinical reports, on a +relatively small clinical cohort of 965 patients. We improve performance +through the use of synthetic data augmentation, by systematically creating scar +images and associated text. In addition, we standardize the orientation of the +images in an anatomy-informed way to enable better alignment of spatial and +text features. We also use a captioning loss to enable fine-grained supervision +and explore the effect of pretraining of the vision encoder on performance. +Finally, ablation studies are carried out to elucidate the contributions of +each design component to the overall performance of the model. -摘要:最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理,在不额外训练的情况下提高推理准确性,同时减轻幻觉。然而,现有的框架通常很僵化,难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠(即值得信赖)的推理。为了解决这个问题,我们引入了 R2-KG,这是一个即插即用、双代理框架,它将推理分为两个角色:一个收集证据的操作员(低容量 LLM)和一个做出最终判断的监督员(高容量 LLM)。这种设计在 LLM 推理方面具有成本效益,同时仍保持强大的推理准确性。此外,R2-KG 采用弃权机制,仅在从知识图谱收集到足够证据时才生成答案,这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明,R2-KG 在准确性和可靠性方面始终优于基线,而与用作操作员的 LLM 的固有能力无关。进一步的实验表明,R2-KG 的单代理版本配备了严格的自一致性策略,实现了明显高于基线的可靠性,同时降低了推理成本。然而,它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖,同时确保了可信的推理。 +摘要:從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務,需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果,但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊,包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型,但它需要大量資料和進一步微調下游任務的策略。在這項研究中,我們使用植基於領域知識的各種策略,僅使用來自臨床報告的文字,在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能,系統性地建立疤痕影像和相關文字。此外,我們以解剖學告知的方式標準化影像方向,以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督,並探討視覺編碼器的預訓練對效能的影響。最後,進行消融研究以闡明每個設計元件對模型整體效能的貢獻。 -##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research** -2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang +##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models** +2502.12825v2 by Rubing Li, João Sedoc, Arun Sundararajan -The rapid advancement of perovskite solar cells (PSCs) has led to an -exponential growth in research publications, creating an urgent need for -efficient knowledge management and reasoning systems in this domain. We present -a comprehensive knowledge-enhanced system for PSCs that integrates three key -components. First, we develop Perovskite-KG, a domain-specific knowledge graph -constructed from 1,517 research papers, containing 23,789 entities and 22,272 -relationships. Second, we create two complementary datasets: Perovskite-Chat, -comprising 55,101 high-quality question-answer pairs generated through a novel -multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully -curated materials science problems. Third, we introduce two specialized large -language models: Perovskite-Chat-LLM for domain-specific knowledge assistance -and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental -results demonstrate that our system significantly outperforms existing models -in both domain-specific knowledge retrieval and scientific reasoning tasks, -providing researchers with effective tools for literature review, experimental -design, and complex problem-solving in PSC research. +When encountering increasingly frequent performance improvements or cost +reductions from a new large language model (LLM), developers of applications +leveraging LLMs must decide whether to take advantage of these improvements or +stay with older tried-and-tested models. Low perceived switching frictions can +lead to choices that do not consider more subtle behavior changes that the +transition may induce. Our experiments use a popular game-theoretic behavioral +economics model of trust to show stark differences in the trusting behavior of +OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust +behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing +and risk-seeking with future returns from trust, and contrast it with +DeepSeek's more sophisticated and profitable trusting behavior that stems from +an ability to incorporate deeper concepts like forward planning and +theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our +results highlight the perils of relying on LLM performance benchmarks that are +too narrowly defined and suggest that careful analysis of their hidden fault +lines should be part of any organization's AI strategy. -摘要:由於 perovskite 太陽能電池 (PSC) 快速進展,導致研究出版物呈指數成長,迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先,我們開發出 Perovskite-KG,一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次,我們建立兩個互補的資料集:Perovskite-Chat,包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對;以及 Perovskite-Reasoning,包含 2,217 個仔細策展的材料科學問題。第三,我們推出兩個專門化大型語言模型:針對領域特定知識協助的 Perovskite-Chat-LLM,以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示,我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型,為研究人員提供有效的工具,用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。 +摘要:在遇到大型語言模型 (LLM) 頻頻帶來的效能提升或成本降低時,利用 LLM 的應用程式開發人員必須決定是否要利用這些提升,或繼續使用較舊且經過驗證的模型。低感知切換摩擦可能會導致選擇,而沒有考慮轉換可能引發的更細微行為變更。我們的實驗使用流行的博弈論行為經濟信任模型,以顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰,因為它們調和了利潤最大化和冒險,以及來自信任的未來回報,並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比,這種行為源於整合更深入的概念,例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎,我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險,並建議仔細分析其隱藏的斷層線應該是任何組織 AI 策略的一部分。 -##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation** -2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li +##### **LLM Safety for Children** +2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat -Explainable recommendation has demonstrated significant advantages in -informing users about the logic behind recommendations, thereby increasing -system transparency, effectiveness, and trustworthiness. To provide -personalized and interpretable explanations, existing works often combine the -generation capabilities of large language models (LLMs) with collaborative -filtering (CF) information. CF information extracted from the user-item -interaction graph captures the user behaviors and preferences, which is crucial -for providing informative explanations. However, due to the complexity of graph -structure, effectively extracting the CF information from graphs still remains -a challenge. Moreover, existing methods often struggle with the integration of -extracted CF information with LLMs due to its implicit representation and the -modality gap between graph structures and natural language explanations. To -address these challenges, we propose G-Refer, a framework using graph -retrieval-augmented large language models (LLMs) for explainable -recommendation. Specifically, we first employ a hybrid graph retrieval -mechanism to retrieve explicit CF signals from both structural and semantic -perspectives. The retrieved CF information is explicitly formulated as -human-understandable text by the proposed graph translation and accounts for -the explanations generated by LLMs. To bridge the modality gap, we introduce -knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of -LLMs to process and utilize the retrieved CF information to generate -explanations. Extensive experiments show that G-Refer achieves superior -performance compared with existing methods in both explainability and -stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer. +This paper analyzes the safety of Large Language Models (LLMs) in +interactions with children below age of 18 years. Despite the transformative +applications of LLMs in various aspects of children's lives such as education +and therapy, there remains a significant gap in understanding and mitigating +potential content harms specific to this demographic. The study acknowledges +the diverse nature of children often overlooked by standard safety evaluations +and proposes a comprehensive approach to evaluating LLM safety specifically for +children. We list down potential risks that children may encounter when using +LLM powered applications. Additionally we develop Child User Models that +reflect the varied personalities and interests of children informed by +literature in child care and psychology. These user models aim to bridge the +existing gap in child safety literature across various fields. We utilize Child +User Models to evaluate the safety of six state of the art LLMs. Our +observations reveal significant safety gaps in LLMs particularly in categories +harmful to children but not adults -摘要:可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點,從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明,現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好,這對於提供資訊性說明至關重要。然而,由於圖形結構的複雜性,從圖形中有效提取 CF 資訊仍然是一個挑戰。此外,現有方法通常難以將提取的 CF 資訊與 LLM 整合,因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰,我們提出 G-Refer,一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說,我們首先採用混合圖形檢索機制,從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字,並說明 LLM 生成的解釋。為了彌合模式差距,我們引入了知識修剪和檢索增強微調,以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明,與現有方法相比,G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。 +摘要:本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面(例如教育和治療)都有轉變性的應用,但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性,而標準安全評估通常會忽略這些多樣性,並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外,我們開發了兒童使用者模型,這些模型反映了兒童不同的個性特質和興趣,並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞,特別是在對兒童有害但對成年人無害的類別中 -##### **A-MEM: Agentic Memory for LLM Agents** -2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang +##### **Classifiers of Data Sharing Statements in Clinical Trial Records** +2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth -While large language model (LLM) agents can effectively use external tools -for complex real-world tasks, they require memory systems to leverage -historical experiences. Current memory systems enable basic storage and -retrieval but lack sophisticated memory organization, despite recent attempts -to incorporate graph databases. Moreover, these systems' fixed operations and -structures limit their adaptability across diverse tasks. To address this -limitation, this paper proposes a novel agentic memory system for LLM agents -that can dynamically organize memories in an agentic way. Following the basic -principles of the Zettelkasten method, we designed our memory system to create -interconnected knowledge networks through dynamic indexing and linking. When a -new memory is added, we generate a comprehensive note containing multiple -structured attributes, including contextual descriptions, keywords, and tags. -The system then analyzes historical memories to identify relevant connections, -establishing links where meaningful similarities exist. Additionally, this -process enables memory evolution - as new memories are integrated, they can -trigger updates to the contextual representations and attributes of existing -historical memories, allowing the memory network to continuously refine its -understanding. Our approach combines the structured organization principles of -Zettelkasten with the flexibility of agent-driven decision making, allowing for -more adaptive and context-aware memory management. Empirical experiments on six -foundation models show superior improvement against existing SOTA baselines. -The source code is available at https://github.com/WujiangXu/AgenticMemory. +Digital individual participant data (IPD) from clinical trials are +increasingly distributed for potential scientific reuse. The identification of +available IPD, however, requires interpretations of textual data-sharing +statements (DSS) in large databases. Recent advancements in computational +linguistics include pre-trained language models that promise to simplify the +implementation of effective classifiers based on textual inputs. In a subset of +5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers +based on domain-specific pre-trained language models reproduce original +availability categories as well as manually annotated labels. Typical metrics +indicate that classifiers that predicted manual annotations outperformed those +that learned to output the original availability categories. This suggests that +the textual DSS descriptions contain applicable information that the +availability categories do not, and that such classifiers could thus aid the +automatic identification of available IPD in large trial databases. -摘要:大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務,但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索,但缺乏精密的記憶體組織,儘管最近嘗試納入圖形資料庫。此外,這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制,本文提出了一種新的代理記憶體系統,供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則,我們設計我們的記憶體系統,透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時,我們會產生包含多個結構化屬性的綜合筆記,包括脈絡描述、關鍵字和標籤。然後,系統會分析歷史記憶體以找出相關連結,在有意義的相似性時建立連結。此外,這個程序能讓記憶體演化,因為當整合新的記憶體時,它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新,讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性,能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。 +摘要:臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而,要找出可用的 IPD,需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型,有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中,我們評估了基於特定領域預先訓練語言模型的分類器,在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示,預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊,而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。 -##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs** -2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li +##### **Relational Norms for Human-AI Cooperation** +2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark -Large language models (LLMs) have demonstrated remarkable capabilities in -various complex tasks, yet they still suffer from hallucinations. Introducing -external knowledge, such as knowledge graph, can enhance the LLMs' ability to -provide factual answers. LLMs have the ability to interactively explore -knowledge graphs. However, most approaches have been affected by insufficient -internal knowledge excavation in LLMs, limited generation of trustworthy -knowledge reasoning paths, and a vague integration between internal and -external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large -model framework driven by the collaboration of internal and external knowledge. -It relies on the internal knowledge of the LLM to guide the exploration of -interpretable directed subgraphs in external knowledge graphs, better -integrating the two knowledge sources for more accurate reasoning. Extensive -experiments on multiple real-world datasets confirm the superiority of -KnowPath. +How we should design and interact with social artificial intelligence depends +on the socio-relational role the AI is meant to emulate or occupy. In human +society, relationships such as teacher-student, parent-child, neighbors, +siblings, or employer-employee are governed by specific norms that prescribe or +proscribe cooperative functions including hierarchy, care, transaction, and +mating. These norms shape our judgments of what is appropriate for each +partner. For example, workplace norms may allow a boss to give orders to an +employee, but not vice versa, reflecting hierarchical and transactional +expectations. As AI agents and chatbots powered by large language models are +increasingly designed to serve roles analogous to human positions - such as +assistant, mental health provider, tutor, or romantic partner - it is +imperative to examine whether and how human relational norms should extend to +human-AI interactions. Our analysis explores how differences between AI systems +and humans, such as the absence of conscious experience and immunity to +fatigue, may affect an AI's capacity to fulfill relationship-specific functions +and adhere to corresponding norms. This analysis, which is a collaborative +effort by philosophers, psychologists, relationship scientists, ethicists, +legal experts, and AI researchers, carries important implications for AI +systems design, user behavior, and regulation. While we accept that AI systems +can offer significant benefits such as increased availability and consistency +in certain socio-relational roles, they also risk fostering unhealthy +dependencies or unrealistic expectations that could spill over into human-human +relationships. We propose that understanding and thoughtfully shaping (or +implementing) suitable human-AI relational norms will be crucial for ensuring +that human-AI interactions are ethical, trustworthy, and favorable to human +well-being. -摘要:大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力,但仍會出現幻覺。引入外部知識(例如知識圖譜)可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而,大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限,以及內部和外部知識之間的整合模糊的影響。因此,我們提出 KnowPath,這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索,更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。 +摘要:我們應如何設計和與社交人工智慧互動,取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中,師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配,這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如,職場規範可能允許老闆對員工發號施令,但反之則不行,這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色,例如助理、心理健康提供者、導師或浪漫伴侶,審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異,例如缺乏意識體驗和對疲勞的免疫力,如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果,對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處,例如增加可用性和一致性,但它們也可能助長不健康的依賴關係或不切實際的期望,這些期望可能會蔓延到人際關係中。我們提出,理解和深思熟慮地塑造(或實施)適當的人類與人工智慧關係規範,對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。 -##### **Atom of Thoughts for Markov LLM Test-Time Scaling** -2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo +##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis** +2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy -Large Language Models (LLMs) achieve superior performance through -training-time scaling, and test-time scaling further enhances their -capabilities by conducting effective reasoning during inference. However, as -the scale of reasoning increases, existing test-time scaling methods suffer -from accumulated historical information, which not only wastes computational -resources but also interferes with effective reasoning. To address this issue, -we observe that complex reasoning progress is often achieved by solving a -sequence of independent subquestions, each being self-contained and verifiable. -These subquestions are essentially atomic questions, relying primarily on their -current state rather than accumulated history, similar to the memoryless -transitions in a Markov process. Based on this observation, we propose Atom of -Thoughts (AoT), where each state transition in the reasoning process consists -of decomposing the current question into a dependency-based directed acyclic -graph and contracting its subquestions, forming a new atomic question state. -This iterative decomposition-contraction process continues until reaching -directly solvable atomic questions, naturally realizing Markov transitions -between question states. Furthermore, these atomic questions can be seamlessly -integrated into existing test-time scaling methods, enabling AoT to serve as a -plug-in enhancement for improving reasoning capabilities. Experiments across -six benchmarks demonstrate the effectiveness of AoT both as a standalone -framework and a plug-in enhancement. Notably, on HotpotQA, when applied to -gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and -DeepSeek-R1 by 10.6%. The code will be available at -https://github.com/qixucen/atom. +Air quality prediction is key to mitigating health impacts and guiding +decisions, yet existing models tend to focus on temporal trends while +overlooking spatial generalization. We propose AQ-Net, a spatiotemporal +reanalysis model for both observed and unobserved stations in the near future. +AQ-Net utilizes the LSTM and multi-head attention for the temporal regression. +We also propose a cyclic encoding technique to ensure continuous time +representation. To learn fine-grained spatial air quality estimation, we +incorporate AQ-Net with the neural kNN to explore feature-based interpolation, +such that we can fill the spatial gaps given coarse observation stations. To +demonstrate the efficiency of our model for spatiotemporal reanalysis, we use +data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive +experiments show that AQ-Net excels in air quality reanalysis, highlighting the +potential of hybrid spatio-temporal models to better capture environmental +dynamics, especially in urban areas where both spatial and temporal variability +are critical. -摘要:大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能,而測試時間擴充透過在推論期間進行有效的推理,進一步提升其能力。然而,隨著推理規模的擴大,現有的測試時間擴充方法會受到累積的歷史資訊影響,這不僅會浪費運算資源,還會干擾有效的推理。為了解決這個問題,我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成,每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題,主要依賴於它們的當前狀態,而不是累積的歷史,類似於馬可夫過程中的無記憶轉換。基於這個觀察,我們提出了思想原子 (AoT),其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖,並收縮其子問題,形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行,直到達到可直接解決的原子問題,自然地實現問題狀態之間的馬可夫轉換。此外,這些原子問題可以無縫整合到現有的測試時間擴充方法中,讓 AoT 可以作為外掛程式強化功能,以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是,在 HotpotQA 上,當應用於 gpt-4o-mini 時,AoT 達到了 80.6% 的 F1 分數,比 o3-mini 高出 3.4%,比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。 +摘要:空气品质预测是减轻健康影响和指导决策的关键,但现有的模型倾向于关注时间趋势,而忽略空间概化。我们提出了 AQ-Net,这是一种时空再分析模型,适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计,我们将 AQ-Net 与神经 kNN 结合起来,以探索基于特征的插值,以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率,我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明,AQ-Net 在空气质量再分析中表现出色,突出了混合时空模型在更好地捕捉环境动态方面的潜力,尤其是在空间和时间变异性都很关键的城市地区。 -##### **Generating Text from Uniform Meaning Representation** -2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein +##### **Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing** +2502.11715v1 by Site Qu, Guoqiang Hu -Uniform Meaning Representation (UMR) is a recently developed graph-based -semantic representation, which expands on Abstract Meaning Representation (AMR) -in a number of ways, in particular through the inclusion of document-level -information and multilingual flexibility. In order to effectively adopt and -leverage UMR for downstream tasks, efforts must be placed toward developing a -UMR technological ecosystem. Though still limited amounts of UMR annotations -have been produced to date, in this work, we investigate the first approaches -to producing text from multilingual UMR graphs: (1) a pipeline conversion of -UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large -language models with UMR data, and (3) fine-tuning existing AMR-to-text -generation models with UMR data. Our best performing model achieves a -multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared -to the reference, which is a promising indication of the effectiveness of -fine-tuning approaches for UMR-to-text generation with even limited amounts of -UMR data. +The Location-Routing Problem (LRP), which combines the challenges of facility +(depot) locating and vehicle route planning, is critically constrained by the +reliance on predefined depot candidates, limiting the solution space and +potentially leading to suboptimal outcomes. Previous research on LRP without +predefined depots is scant and predominantly relies on heuristic algorithms +that iteratively attempt depot placements across a planar area. Such approaches +lack the ability to proactively generate depot locations that meet specific +geographic requirements, revealing a notable gap in current research landscape. +To bridge this gap, we propose a data-driven generative DRL framework, designed +to proactively generate depots for LRP without predefined depot candidates, +solely based on customer requests data which include geographic and demand +information. It can operate in two distinct modes: direct generation of exact +depot locations, and the creation of a multivariate Gaussian distribution for +flexible depots sampling. By extracting depots' geographic pattern from +customer requests data, our approach can dynamically respond to logistical +needs, identifying high-quality depot locations that further reduce total +routing costs compared to traditional methods. Extensive experiments +demonstrate that, for a same group of customer requests, compared with those +depots identified through random attempts, our framework can proactively +generate depots that lead to superior solution routes with lower routing cost. +The implications of our framework potentially extend into real-world +applications, particularly in emergency medical rescue and disaster relief +logistics, where rapid establishment and adjustment of depot locations are +paramount, showcasing its potential in addressing LRP for dynamic and +unpredictable environments. -摘要:統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示,它在許多方面擴展了抽象語意表示 (AMR),特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR,必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限,但在這項工作中,我們探討了從多語言 UMR 圖形產生文字的第一種方法:(1) 將 UMR 轉換為 AMR 的管道,然後使用 AMR 轉文字生成模型,(2) 使用 UMR 資料微調大型語言模型,以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比,我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數,在中文中達到 0.882,這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果,即使 UMR 資料數量有限。 +摘要:地點路線問題(LRP)結合了設施(倉庫)定位和車輛路線規劃的挑戰,嚴重受到預先定義的倉庫候選限制,限制了解決方案空間,並可能導致次優結果。先前關於沒有預先定義倉庫的 LRP 研究很少,而且主要依賴於啟發式演算法,在平面區域中反覆嘗試倉庫配置。這種方法無法主動產生符合特定地理需求的倉庫位置,顯示了當前研究領域的顯著差距。為了彌補這個差距,我們提出一個資料驅動的生成式 DRL 架構,旨在主動為 LRP 產生倉庫,而無需預先定義的倉庫候選,僅根據包含地理和需求資訊的客戶要求資料。它可以在兩種不同的模式下運作:直接產生確切的倉庫位置,以及建立多元高斯分布以進行彈性倉庫抽樣。透過從客戶要求資料中提取倉庫的地理模式,我們的方法可以動態回應後勤需求,找出高品質的倉庫位置,進一步降低與傳統方法相比的總路線成本。廣泛的實驗證明,對於同一組客戶要求,與透過隨機嘗試識別的那些倉庫相比,我們的架構可以主動產生倉庫,並產生路線成本較低的優質解決方案路線。我們的架構的影響潛在地擴展到實際應用,特別是在緊急醫療救援和災害救災後勤方面,其中倉庫位置的快速建立和調整至關重要,展示了其在解決動態和不可預測環境的 LRP 中的潛力。 -##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs** -2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han +##### **LLM Agents Making Agent Tools** +2502.11705v1 by Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather -The rapid development of Multimodal Large Language Models (MLLMs) has enabled -the integration of multiple modalities, including texts and images, within the -large language model (LLM) framework. However, texts and images are usually -interconnected, forming a multimodal attributed graph (MMAG). It is -underexplored how MLLMs can incorporate the relational information -(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts -and images) on such graphs for multimodal comprehension and generation. In this -paper, we propose GraphGPT-o, which supports omni-multimodal understanding and -creation on MMAGs. We first comprehensively study linearization variants to -transform semantic and structural information as input for MLLMs. Then, we -propose a hierarchical aligner that enables deep graph encoding, bridging the -gap between MMAGs and MLLMs. Finally, we explore the inference choices, -adapting MLLM to interleaved text and image generation in graph scenarios. -Extensive experiments on three datasets from different domains demonstrate the -effectiveness of our proposed method. Datasets and codes will be open-sourced -upon acceptance. +Tool use has turned large language models (LLMs) into powerful agents that +can perform complex multi-step tasks by dynamically utilising external software +components. However, these tools must be implemented in advance by human +developers, hindering the applicability of LLM agents in domains which demand +large numbers of highly specialised tools, like in life sciences and medicine. +Motivated by the growing trend of scientific studies accompanied by public code +repositories, we propose ToolMaker, a novel agentic framework that autonomously +transforms papers with code into LLM-compatible tools. Given a short task +description and a repository URL, ToolMaker autonomously installs required +dependencies and generates code to perform the task, using a closed-loop +self-correction mechanism to iteratively diagnose and rectify errors. To +evaluate our approach, we introduce a benchmark comprising 15 diverse and +complex computational tasks spanning both medical and non-medical domains with +over 100 unit tests to objectively assess tool correctness and robustness. +ToolMaker correctly implements 80% of the tasks, substantially outperforming +current state-of-the-art software engineering agents. ToolMaker therefore is a +step towards fully autonomous agent-based scientific workflows. -摘要:多模态大语言模型 (MLLM) 的快速发展,促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而,文本和图像通常是相互关联的,形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息(即图结构)和语义信息(即文本和图像)以进行多模态理解和生成,目前仍未得到充分探索。在本文中,我们提出了 GraphGPT-o,它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体,以将语义和结构信息转换为 MLLM 的输入。然后,我们提出了一个分层对齐器,它支持深度图编码,弥合了 MMAG 和 MLLM 之间的差距。最后,我们探索了推理选择,使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。 +摘要:工具使用已將大型語言模型 (LLM) 轉變為強大的代理,可透過動態使用外部軟體元件來執行複雜的多步驟任務。然而,這些工具必須事先由人類開發人員實作,這會阻礙 LLM 代理在需要大量高度專業化工具的領域(例如生命科學和醫學)中的應用性。受到伴隨公開程式碼儲存庫的科學研究趨勢所啟發,我們提出 ToolMaker,一個創新的代理架構,可自主地將帶有程式碼的論文轉換為相容於 LLM 的工具。給定簡短的任務描述和儲存庫網址,ToolMaker 會自主安裝所需的依賴項,並產生程式碼來執行任務,使用閉環自我修正機制來反覆診斷和糾正錯誤。為了評估我們的做法,我們引進一個包含 15 個不同且複雜的運算任務的基準,涵蓋醫療和非醫療領域,並包含超過 100 個單元測試,以客觀評估工具的正確性和穩健性。ToolMaker 正確實作了 80% 的任務,大幅優於目前的最新軟體工程代理。因此,ToolMaker 是邁向完全自主的基於代理的科學工作流程的一步。 -##### **Exploring LLM-based Student Simulation for Metacognitive Cultivation** -2502.11678v1 by Haoxuan Li, Jifan Yu, Xin Cong, Yang Dang, Yisi Zhan, Huiqin Liu, Zhiyuan Liu +##### **MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression** +2502.11651v1 by Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang -Metacognitive education plays a crucial role in cultivating students' -self-regulation and reflective thinking, providing essential support for those -with learning difficulties through academic advising. Simulating students with -insufficient learning capabilities using large language models offers a -promising approach to refining pedagogical methods without ethical concerns. -However, existing simulations often fail to authentically represent students' -learning struggles and face challenges in evaluation due to the lack of -reliable metrics and ethical constraints in data collection. To address these -issues, we propose a pipeline for automatically generating and filtering -high-quality simulated student agents. Our approach leverages a two-round -automated scoring system validated by human experts and employs a score -propagation module to obtain more consistent scores across the student graph. -Experimental results demonstrate that our pipeline efficiently identifies -high-quality student agents, and we discuss the traits that influence the -simulation's effectiveness. By simulating students with varying degrees of -learning difficulties, our work paves the way for broader applications in -personalized learning and educational assessment. +Large vision-language models (LVLMs) have shown great promise in medical +applications, particularly in visual question answering (MedVQA) and diagnosis +from medical images. However, existing datasets and models often fail to +consider critical aspects of medical diagnostics, such as the integration of +historical records and the analysis of disease progression over time. In this +paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel +dataset for MedVQA that focuses on identifying changes in specific regions +between two patient visits. Unlike previous datasets that primarily address +single-image questions, MMXU enables multi-image questions, incorporating both +current and historical patient data. We demonstrate the limitations of current +LVLMs in identifying disease progression on MMXU-\textit{test}, even those that +perform well on traditional benchmarks. To address this, we propose a +MedRecord-Augmented Generation (MAG) approach, incorporating both global and +regional historical records. Our experiments show that integrating historical +records significantly enhances diagnostic accuracy by at least 20\%, bridging +the gap between current LVLMs and human expert performance. Additionally, we +fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable +improvements. We hope this work could illuminate the avenue of advancing the +use of LVLMs in medical diagnostics by emphasizing the importance of historical +context in interpreting medical images. Our dataset is released at +\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU}. -摘要:元認知教育在培養學生的自我調節和反思性思考中發揮著至關重要的作用,通過學術諮詢為有學習困難的人提供必要的支持。使用大型語言模型模擬學習能力不足的學生提供了一種有前途的方法,可以在沒有道德問題的情況下改進教學方法。然而,現有的模擬通常無法真實地反映學生的學習困難,並且由於缺乏可靠的指標和數據收集中的道德約束,在評估中面臨挑戰。為了解決這些問題,我們提出了一個自動生成和過濾高質量模擬學生代理的管道。我們的做法利用了由人類專家驗證的兩輪自動評分系統,並採用分數傳播模組來獲得跨學生圖表更一致的分數。實驗結果表明,我們的管道有效地識別了高質量的學生代理,並且我們討論了影響模擬效果的特質。通過模擬具有不同程度學習困難的學生,我們的研究為個性化學習和教育評估中的更廣泛應用鋪平了道路。 +摘要:大型視覺語言模型 (LVLMs) 已在醫療應用中展現出極大的潛力,特別是在視覺問答 (MedVQA) 和醫學影像診斷方面。然而,現有的資料集和模型常常無法考量醫療診斷的關鍵層面,例如病歷整合以及隨著時間推移對疾病進程的分析。在本文中,我們介紹 MMXU(多模態多 X 光理解),一個專注於識別兩次患者就診之間特定區域變化的 MedVQA 新資料集。與主要處理單一影像問題的先前資料集不同,MMXU 支援多影像問題,同時納入當前和病史患者資料。我們展示了現有 LVLMs 在 MMXU-\textit{test} 中識別疾病進程的限制,即使是在傳統基準測試中表現良好的 LVLMs 也是如此。為了解決這個問題,我們提出了一個病歷增強生成 (MAG) 方法,結合了全域和區域病史。我們的實驗顯示,整合病歷可顯著提升至少 20% 的診斷準確度,縮小了現有 LVLMs 和人類專家表現之間的差距。此外,我們在 MMXU-\textit{dev} 上微調帶有 MAG 的模型,這展示了顯著的進步。我們希望這項工作能透過強調病史脈絡在解讀醫學影像中的重要性,為推進 LVLMs 在醫療診斷中的應用開闢道路。我們的資料集已於\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU} 發布。 + +##### **A Survey of Personalized Large Language Models: Progress and Future Directions** +2502.11528v1 by Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Jieming Zhu, Minda Hu, Menglin Yang, Irwin King + +Large Language Models (LLMs) excel in handling general knowledge tasks, yet +they struggle with user-specific personalization, such as understanding +individual emotions, writing styles, and preferences. Personalized Large +Language Models (PLLMs) tackle these challenges by leveraging individual user +data, such as user profiles, historical dialogues, content, and interactions, +to deliver responses that are contextually relevant and tailored to each user's +specific needs. This is a highly valuable research topic, as PLLMs can +significantly enhance user satisfaction and have broad applications in +conversational agents, recommendation systems, emotion recognition, medical +assistants, and more. This survey reviews recent advancements in PLLMs from +three technical perspectives: prompting for personalized context (input level), +finetuning for personalized adapters (model level), and alignment for +personalized preferences (objective level). To provide deeper insights, we also +discuss current limitations and outline several promising directions for future +research. Updated information about this survey can be found at the +https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models. + +摘要:大型語言模型 (LLM) 在處理一般知識任務方面表現出色,但 +它們在使用者特定的個人化方面有困難,例如理解 +個別的情緒、寫作風格和偏好。個人化大型 +語言模型 (PLLM) 透過利用個別使用者的 +資料來解決這些挑戰,例如使用者個人資料、歷史對話、內容和互動, +提供在脈絡上相關且針對每個使用者的特定需求量身打造的回應。這是一個非常有價值的研究主題,因為 PLLM 可以 +顯著提升使用者滿意度,並在對話代理、推薦系統、情緒辨識、醫療 +助理等方面有廣泛的應用。這項調查從三個技術觀點回顧 PLLM 的最新進展:提示個人化脈絡(輸入層級)、微調個人化適配器(模型層級),以及對齊個人化偏好(目標層級)。為了提供更深入的見解,我們也 +討論目前的限制,並概述未來研究的幾個有希望的方向。這項調查的最新資訊可以在 +https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models 找到。 -##### **Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering** -2502.11491v1 by Runxuan Liu, Bei Luo, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin +##### **Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos** +2502.11481v1 by Xiangxiang Cui, Zhongyu Li, Xiayue Fan, Peng Huang, Ying Wang, Meng Yang, Shi Chang, Jihua Zhu -Large language models (LLMs) have shown remarkable capabilities in natural -language processing. However, in knowledge graph question answering tasks -(KGQA), there remains the issue of answering questions that require multi-hop -reasoning. Existing methods rely on entity vector matching, but the purpose of -the question is abstract and difficult to match with specific entities. As a -result, it is difficult to establish reasoning paths to the purpose, which -leads to information loss and redundancy. To address this issue, inspired by -human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a -novel framework that constructs reasoning paths from purposes back to -conditions. ORT operates in three key phases: (1) using LLM to extract purpose -labels and condition labels, (2) constructing label reasoning paths based on -the KG ontology, and (3) using the label reasoning paths to guide knowledge -retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves -state-of-the-art performance and significantly enhances the capability of LLMs -for KGQA. +The intersection of medical imaging and artificial intelligence has become an +important research direction in intelligent medical treatment, particularly in +the analysis of medical images using deep learning for clinical diagnosis. +Despite the advances, existing keyframe classification methods lack extraction +of time series features, while ultrasonic video classification based on +three-dimensional convolution requires uniform frame numbers across patients, +resulting in poor feature extraction efficiency and model classification +performance. This study proposes a novel video classification method based on +CNN and LSTM, introducing NLP's long and short sentence processing scheme into +video classification for the first time. The method reduces CNN-extracted image +features to 1x512 dimension, followed by sorting and compressing feature +vectors for LSTM training. Specifically, feature vectors are sorted by patient +video frame numbers and populated with padding value 0 to form variable +batches, with invalid padding values compressed before LSTM training to +conserve computing resources. Experimental results demonstrate that our +variable-frame CNNLSTM method outperforms other approaches across all metrics, +showing improvements of 3-6% in F1 score and 1.5% in specificity compared to +keyframe methods. The variable-frame CNNLSTM also achieves better accuracy and +precision than equal-frame CNNLSTM. These findings validate the effectiveness +of our approach in classifying variable-frame ultrasound videos and suggest +potential applications in other medical imaging modalities. -摘要:大型語言模型 (LLM) 在自然語言處理中展現出卓越的能力。然而,在知識圖譜問答任務 (KGQA) 中,仍然存在需要多跳推理才能回答問題的問題。現有方法依賴於實體向量匹配,但問題的目的是抽象的,難以與特定實體匹配。因此,很難建立推理路徑來達成目的,這會導致資訊遺失和冗餘。為了解決這個問題,在人類逆向思維的啟發下,我們提出了基於本体的逆向思維 (ORT),這是一個創新的架構,可以從目的建構推理路徑,再回推到條件。ORT 運作在三個關鍵階段:(1) 使用 LLM 萃取目的標籤和條件標籤,(2) 基於 KG 本体建構標籤推理路徑,以及 (3) 使用標籤推理路徑來引導知識擷取。在 WebQSP 和 CWQ 資料集上的實驗顯示,ORT 達到了最先進的效能,並顯著增強了 LLM 對 KGQA 的能力。 +摘要:醫學影像與人工智慧的交叉領域已成為智慧醫療的重要研究方向,特別是在臨床診斷中使用深度學習分析醫學影像。儘管有進展,現有的關鍵影格分類方法缺乏時間序列特徵的提取,而基於三維卷積的超音波影片分類需要患者之間的均勻影格數,導致特徵提取效率差和模型分類效能不佳。本研究提出了一種基於 CNN 和 LSTM 的新影片分類方法,首次將 NLP 的長短句處理機制引入影片分類中。該方法將 CNN 提取的影像特徵縮減為 1x512 維度,然後對特徵向量進行排序和壓縮以進行 LSTM 訓練。具體來說,特徵向量按患者影片影格數排序,並填充 0 補齊值以形成可變批次,在 LSTM 訓練前壓縮無效的補齊值以節省運算資源。實驗結果表明,我們的可變影格 CNNLSTM 方法在所有指標上都優於其他方法,與關鍵影格方法相比,F1 分數提高了 3-6%,特異性提高了 1.5%。可變影格 CNNLSTM 也比等影格 CNNLSTM 達到了更好的準確度和精確度。這些發現驗證了我們的方法在分類可變影格超音波影片中的有效性,並表明在其他醫學影像模式中具有潛在的應用。 -##### **GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion** -2502.11471v1 by Kangyang Luo, Yuzhuo Bai, Cheng Gao, Shuzheng Si, Yingli Shen, Zhu Liu, Zhitong Wang, Cunliang Kong, Wenhao Li, Yufei Huang, Ye Tian, Xuantang Xiong, Lei Han, Maosong Sun +##### **Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation** +2502.11456v1 by Yanyan Wang, Kechen Song, Yuyuan Liu, Shuai Ma, Yunhui Yan, Gustavo Carneiro -Knowledge Graph Completion (KGC), which aims to infer missing or incomplete -facts, is a crucial task for KGs. However, integrating the vital structural -information of KGs into Large Language Models (LLMs) and outputting predictions -deterministically remains challenging. To address this, we propose a new method -called GLTW, which encodes the structural information of KGs and merges it with -LLMs to enhance KGC performance. Specifically, we introduce an improved Graph -Transformer (iGT) that effectively encodes subgraphs with both local and global -structural information and inherits the characteristics of language model, -bypassing training from scratch. Also, we develop a subgraph-based -multi-classification training objective, using all entities within KG as -classification objects, to boost learning efficiency.Importantly, we combine -iGT with an LLM that takes KG language prompts as input.Our extensive -experiments on various KG datasets show that GLTW achieves significant -performance gains compared to SOTA baselines. +Semi-supervised 3D medical image segmentation aims to achieve accurate +segmentation using few labelled data and numerous unlabelled data. The main +challenge in the design of semi-supervised learning methods consists in the +effective use of the unlabelled data for training. A promising solution +consists of ensuring consistent predictions across different views of the data, +where the efficacy of this strategy depends on the accuracy of the +pseudo-labels generated by the model for this consistency learning strategy. In +this paper, we introduce a new methodology to produce high-quality +pseudo-labels for a consistency learning strategy to address semi-supervised 3D +medical image segmentation. The methodology has three important contributions. +The first contribution is the Cooperative Rectification Learning Network (CRLN) +that learns multiple prototypes per class to be used as external knowledge +priors to adaptively rectify pseudo-labels at the voxel level. The second +contribution consists of the Dynamic Interaction Module (DIM) to facilitate +pairwise and cross-class interactions between prototypes and multi-resolution +image features, enabling the production of accurate voxel-level clues for +pseudo-label rectification. The third contribution is the Cooperative Positive +Supervision (CPS), which optimises uncertain representations to align with +unassertive representations of their class distributions, improving the model's +accuracy in classifying uncertain regions. Extensive experiments on three +public 3D medical segmentation datasets demonstrate the effectiveness and +superiority of our semi-supervised learning method. -摘要:知識圖譜補全 (KGC) 旨在推論遺失或不完整的 -事實,是 KGs 的一項關鍵任務。然而,將 KGs 的重要結構 -資訊整合至大型語言模型 (LLM),並確定性地輸出預測結果,仍然是一項挑戰。為了解決這個問題,我們提出了一種新的方法,稱為 GLTW,它編碼了 KGs 的結構資訊,並將其與 LLM 合併,以增強 KGC 的效能。具體來說,我們引進了一個改良的圖形轉換器 (iGT),它能有效地編碼具有局部和全域結構資訊的子圖,並繼承語言模型的特徵,繞過從頭開始的訓練。此外,我們開發了一個基於子圖的多分類訓練目標,使用 KG 中的所有實體作為 -分類物件,以提升學習效率。重要的是,我們將 iGT 與一個將 KG 語言提示作為輸入的 LLM 結合起來。我們在各種 KG 資料集上進行的廣泛實驗顯示,與 SOTA 基準線相比,GLTW 獲得了顯著的效能提升。 +摘要:半监督 3D 医学影像分割旨在使用少量标记数据和大量未标记数据实现精确分割。半监督学习方法设计中的主要挑战在于有效使用未标记数据进行训练。一个有前景的解决方案是确保数据不同视图之间预测的一致性,其中此策略的有效性取决于模型为这种一致性学习策略生成的伪标签的准确性。在本文中,我们引入了一种新的方法来为一致性学习策略生成高质量的伪标签,以解决半监督 3D 医学图像分割问题。该方法有三个重要的贡献。第一个贡献是协作修正学习网络 (CRLN),它为每个类别学习多个原型,用作外部知识先验,以在体素级别自适应地修正伪标签。第二个贡献包括动态交互模块 (DIM),以促进原型和多分辨率图像特征之间的成对和跨类交互,从而能够生成用于伪标签修正的准确体素级线索。第三个贡献是协作正监督 (CPS),它优化不确定的表示以与其类分布的不确定表示保持一致,从而提高模型对不确定区域进行分类的准确性。在三个公共 3D 医学分割数据集上进行的大量实验表明了我们半监督学习方法的有效性和优越性。 -##### **Large Language-Geometry Model: When LLM meets Equivariance** -2502.11149v2 by Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao +##### **A Survey of LLM-based Agents in Medicine: How far are we from Baymax?** +2502.11211v1 by Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Wenting Chen, Xiang Li, Yixuan Yuan -Accurately predicting 3D structures and dynamics of physical systems is -crucial in scientific applications. Existing approaches that rely on geometric -Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, -but they often fall in leveraging extensive broader information. While direct -application of Large Language Models (LLMs) can incorporate external knowledge, -they lack the capability for spatial reasoning with guaranteed equivariance. In -this paper, we propose EquiLLM, a novel framework for representing 3D physical -systems that seamlessly integrates E(3)-equivariance with LLM capabilities. -Specifically, EquiLLM comprises four key components: geometry-aware prompting, -an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the -LLM guided by the instructive prompt serves as a sophisticated invariant -feature processor, while 3D directional information is exclusively handled by -the equivariant encoder and adaptor modules. Experimental results demonstrate -that EquiLLM delivers significant improvements over previous methods across -molecular dynamics simulation, human motion simulation, and antibody design, -highlighting its promising generalizability. +Large Language Models (LLMs) are transforming healthcare through the +development of LLM-based agents that can understand, reason about, and assist +with medical tasks. This survey provides a comprehensive review of LLM-based +agents in medicine, examining their architectures, applications, and +challenges. We analyze the key components of medical agent systems, including +system profiles, clinical planning mechanisms, medical reasoning frameworks, +and external capacity enhancement. The survey covers major application +scenarios such as clinical decision support, medical documentation, training +simulations, and healthcare service optimization. We discuss evaluation +frameworks and metrics used to assess these agents' performance in healthcare +settings. While LLM-based agents show promise in enhancing healthcare delivery, +several challenges remain, including hallucination management, multimodal +integration, implementation barriers, and ethical considerations. The survey +concludes by highlighting future research directions, including advances in +medical reasoning inspired by recent developments in LLM architectures, +integration with physical systems, and improvements in training simulations. +This work provides researchers and practitioners with a structured overview of +the current state and future prospects of LLM-based agents in medicine. -摘要:準確預測物理系統的 3D 結構和動力學在科學應用中至關重要。現有依賴於幾何圖神經網路 (GNN) 的方法有效地強制執行了 $\mathrm{E}(3)$-等變性,但它們通常無法利用廣泛的更廣泛資訊。儘管大型語言模型 (LLM) 的直接應用可以納入外部知識,但它們缺乏保證等變性的空間推理能力。在本文中,我們提出了 EquiLLM,一個用於表示 3D 物理系統的新框架,它將 E(3)-等變性與 LLM 能力無縫整合。具體來說,EquiLLM 包含四個關鍵組成部分:感知幾何的提示、等變編碼器、LLM 和等變適配器。從本質上講,由指導性提示引導的 LLM 作為一個複雜的不變特徵處理器,而 3D 方向資訊則由等變編碼器和適配器模組獨家處理。實驗結果表明,EquiLLM 在分子動力學模擬、人類運動模擬和抗體設計方面比以前的方法有了顯著的改進,突顯了其有希望的泛化能力。 +摘要:大型語言模型 (LLM) 透過開發可理解、推理並協助醫療任務的 LLM 基礎代理人,轉變了醫療保健。本調查提供了 LLM 基礎代理人在醫學中的全面回顧,探討其架構、應用和挑戰。我們分析了醫療代理系統的主要組成部分,包括系統概況、臨床規劃機制、醫療推理架構和外部能力提升。本調查涵蓋了主要的應用場景,例如臨床決策支援、醫療文件、訓練模擬和醫療保健服務最佳化。我們討論了用於評估這些代理人在醫療保健環境中表現的評估架構和指標。雖然 LLM 基礎代理人顯示出在增強醫療保健提供方面的潛力,但仍有許多挑戰,包括幻覺管理、多模態整合、實施障礙和倫理考量。本調查最後強調了未來的研究方向,包括受 LLM 架構近期發展啟發的醫療推理進展、與物理系統的整合和訓練模擬的改進。這項工作為研究人員和從業人員提供了 LLM 基礎代理人在醫學中當前狀態和未來前景的結構化概觀。 -##### **Beyond Pairwise: Global Zero-shot Temporal Graph Generation** -2502.11114v1 by Alon Eirew, Kfir Bar, Ido Dagan +##### **RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer** +2502.11179v1 by Shilong Yang, Qi Zang, Chulong Zhang, Lingfeng Huang, Yaoqin Xie -Temporal relation extraction (TRE) is a fundamental task in natural language -processing (NLP) that involves identifying the temporal relationships between -events in a document. Despite the advances in large language models (LLMs), -their application to TRE remains limited. Most existing approaches rely on -pairwise classification, in which event pairs are considered individually, -leading to computational inefficiency and a lack of global consistency in the -resulting temporal graph. In this work, we propose a novel zero-shot method for -TRE that generates a document's complete temporal graph at once, then applies -transitive constraints optimization to refine predictions and enforce temporal -consistency across relations. Additionally, we introduce OmniTemp, a new -dataset with complete annotations for all pairs of targeted events within a -document. Through experiments and analyses, we demonstrate that our method -significantly outperforms existing zero-shot approaches while achieving -competitive performance with supervised models. +Traditional Chinese acupuncture methods often face controversy in clinical +practice due to their high subjectivity. Additionally, current +intelligent-assisted acupuncture systems have two major limitations: slow +acupoint localization speed and low accuracy. To address these limitations, a +new method leverages the excellent inference efficiency of the state-space +model Mamba, while retaining the advantages of the attention mechanism in the +traditional DETR architecture, to achieve efficient global information +integration and provide high-quality feature information for acupoint +localization tasks. Furthermore, by employing the concept of residual +likelihood estimation, it eliminates the need for complex upsampling processes, +thereby accelerating the acupoint localization task. Our method achieved +state-of-the-art (SOTA) accuracy on a private dataset of acupoints on the human +back, with an average Euclidean distance pixel error (EPE) of 7.792 and an +average time consumption of 10.05 milliseconds per localization task. Compared +to the second-best algorithm, our method improved both accuracy and speed by +approximately 14\%. This significant advancement not only enhances the efficacy +of acupuncture treatment but also demonstrates the commercial potential of +automated acupuncture robot systems. Access to our method is available at +https://github.com/Sohyu1/RT-DEMT -摘要:時間關係抽取 (TRE) 是自然語言處理 (NLP) 中的一項基本任務,涉及識別文件中事件之間的時間關係。儘管大型語言模型 (LLM) 取得進展,但它們在 TRE 中的應用仍然有限。現有的大多數方法依賴於成對分類,其中事件對被單獨考慮,導致計算效率低下且在生成的時序圖中缺乏全局一致性。在這項工作中,我們提出了一種新穎的 TRE 零次學習方法,它可以一次生成文件的完整時序圖,然後應用遞移約束最佳化來優化預測並強制關係之間的時間一致性。此外,我們引入了 OmniTemp,這是一個新的數據集,其中包含文件內所有目標事件對的完整註解。通過實驗和分析,我們證明了我們的方法明顯優於現有的零次學習方法,同時實現了與監督模型相當的性能。 +摘要:傳統的中醫針灸方法由於其高度主觀性,在臨床實務中經常面臨爭議。此外,現有的智慧輔助針灸系統有兩大限制:取穴速度慢以及準確度低。為了解決這些限制,一種新的方法利用了狀態空間模型 Mamba 優異的推理效率,同時保留了傳統 DETR 架構中注意力機制的優點,以實現高效的全局資訊整合,並為取穴任務提供高品質的特徵資訊。此外,透過採用殘差似然估計的概念,它消除了對複雜上採樣程序的需求,從而加速了取穴任務。我們的模型在人體背部穴位私人資料集上達到了最先進 (SOTA) 的準確度,平均歐幾里得距離像素誤差 (EPE) 為 7.792,平均每個取穴任務耗時 10.05 毫秒。與第二好的演算法相比,我們的模型在準確度和速度上都提高了大約 14%。這項重大進展不僅提高了針灸治療的療效,也證明了自動化針灸機器人系統的商業潛力。我們的模型可以在 https://github.com/Sohyu1/RT-DEMT 取得 ##### **Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications** 2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy @@ -8621,156 +8606,180 @@ chatbot applications. 摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。 -##### **Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection** -2502.11062v1 by Yang Zhao, Li Du, Xiao Ding, Yangou Ouyang, Hepeng Wang, Kai Xiong, Jinglong Gao, Zhouhao Sun, Dongliang Xu, Yang Qing, Dongchen Li, Bing Qin, Ting Liu +##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration** +2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang -Large language models (LLMs) have shown great potential across various -industries due to their remarkable ability to generalize through instruction -tuning. However, the limited availability of domain-specific data significantly -hampers their performance on specialized tasks. While existing methods -primarily focus on selecting training data from general datasets that are -similar to the target domain, they often fail to consider the joint -distribution of instructions, resulting in inefficient learning and suboptimal -knowledge transfer. To address these challenges, we introduce G2IS -(Gradient-based Graph Instruction Selection), a novel method that constructs a -mixed gradient-based instruction graph to capture the joint distribution and -interdependencies between instructions. By accounting for the relationships -between instructions, G2IS improves domain adaptation efficiency. Additionally, -we propose a gradient walk algorithm to refine the data selection process, -enhancing both training effectiveness and efficiency. Our experiments -demonstrate that G2IS outperforms traditional methods across various domain -adaptation tasks, yielding significant performance gains, particularly in -complex, data-scarce scenarios. These results underscore the potential of G2IS -in advancing the development of large, domain-specific models. +Automatic depression detection provides cues for early clinical intervention +by clinicians. Clinical interviews for depression detection involve dialogues +centered around multiple themes. Existing studies primarily design end-to-end +neural network models to capture the hierarchical structure of clinical +interview dialogues. However, these methods exhibit defects in modeling the +thematic content of clinical interviews: 1) they fail to capture intra-theme +and inter-theme correlation explicitly, and 2) they do not allow clinicians to +intervene and focus on themes of interest. To address these issues, this paper +introduces an interactive depression detection framework. This framework +leverages in-context learning techniques to identify themes in clinical +interviews and then models both intra-theme and inter-theme correlation. +Additionally, it employs AI-driven feedback to simulate the interests of +clinicians, enabling interactive adjustment of theme importance. PDIMC achieves +absolute improvements of 35\% and 12\% compared to the state-of-the-art on the +depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of +modeling theme correlation and incorporating interactive external feedback. -摘要:大型語言模型 (LLM) 因其透過指令微調而具備的卓越泛化能力,在各產業中展現出極大的潛力。然而,特定領域資料的取得有限,大幅影響其在專業任務上的表現。現有方法主要專注於從與目標領域類似的通用資料集中選取訓練資料,但它們通常未能考量指令的聯合分佈,導致學習效率不彰且知識傳遞不佳。為了應對這些挑戰,我們引進 G2IS(基於梯度的圖形指令選取),這是一種創新的方法,可建構一個混合的基於梯度的指令圖形,以擷取指令之間的聯合分佈和相互依賴性。透過考量指令之間的關係,G2IS 提升了領域適應的效率。此外,我們提出了一種梯度漫步演算法來優化資料選取程序,同時提升訓練效能和效率。我們的實驗證明,G2IS 在各種領域適應任務中優於傳統方法,產生顯著的效能提升,特別是在資料稀少的複雜場景中。這些結果突顯了 G2IS 在推動大型特定領域模型發展方面的潛力。 +摘要:自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而,這些方法在建模臨床訪談的主題內容時表現出缺陷:1)它們無法明確捕捉主題內和主題間的關聯性,以及 2)它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題,本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題,然後對主題內和主題間的關聯性進行建模。此外,它採用 AI 驅動的回饋來模擬臨床醫師的興趣,實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比,PDIMC 的絕對改進率分別為 35% 和 12%,這證明了對主題關聯性建模和納入互動式外部回饋的有效性。 -##### **CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models** -2502.11008v1 by Yuefei Chen, Vivek K. Singh, Jing Ma, Ruxiang Tang +##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening** +2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu -Counterfactual reasoning is widely recognized as one of the most challenging -and intricate aspects of causality in artificial intelligence. In this paper, -we evaluate the performance of large language models (LLMs) in counterfactual -reasoning. In contrast to previous studies that primarily focus on commonsense -causal reasoning, where LLMs often rely on prior knowledge for inference, we -specifically assess their ability to perform counterfactual inference using a -set of formal rules. To support this evaluation, we introduce a new benchmark -dataset, CounterBench, comprising 1K counterfactual reasoning questions. The -dataset is designed with varying levels of difficulty, diverse causal graph -structures, distinct types of counterfactual questions, and multiple -nonsensical name variants. Our experiments demonstrate that counterfactual -reasoning poses a significant challenge for LLMs, with most models performing -at levels comparable to random guessing. To enhance LLM's counterfactual -reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides -LLMs through iterative reasoning and backtracking to systematically explore -counterfactual solutions. Experimental results show that our method -significantly improves LLM performance on counterfactual reasoning tasks and -consistently enhances performance across different LLMs.Our dataset is -available at https://huggingface.co/datasets/CounterBench/CounterBench. +Due to the rise in antimicrobial resistance, identifying novel compounds with +antibiotic potential is crucial for combatting this global health issue. +However, traditional drug development methods are costly and inefficient. +Recognizing the pressing need for more effective solutions, researchers have +turned to machine learning techniques to streamline the prediction and +development of novel antibiotic compounds. While foundation models have shown +promise in antibiotic discovery, current mainstream efforts still fall short of +fully leveraging the potential of multimodal molecular data. Recent studies +suggest that contrastive learning frameworks utilizing multimodal data exhibit +excellent performance in representation learning across various domains. +Building upon this, we introduce CL-MFAP, an unsupervised contrastive learning +(CL)-based multimodal foundation (MF) model specifically tailored for +discovering small molecules with potential antibiotic properties (AP) using +three types of molecular data. This model employs 1.6 million bioactive +molecules with drug-like properties from the ChEMBL dataset to jointly pretrain +three encoders: (1) a transformer-based encoder with rotary position embedding +for processing SMILES strings; (2) another transformer-based encoder, +incorporating a novel bi-level routing attention mechanism to handle molecular +graph representations; and (3) a Morgan fingerprint encoder using a multilayer +perceptron, to achieve the contrastive learning purpose. The CL-MFAP +outperforms baseline models in antibiotic property prediction by effectively +utilizing different molecular modalities and demonstrates superior +domain-specific performance when fine-tuned for antibiotic-related property +prediction tasks. -摘要:反事實推理被廣泛認為是人工智慧中因果關係最具挑戰性和複雜的面向之一。在本文中,我們評估大型語言模型 (LLM) 在反事實推理中的表現。與主要關注常識因果推理,其中 LLM 經常依賴先驗知識來進行推理的先前研究不同,我們特別評估它們使用一組形式規則執行反事實推理的能力。為了支持此評估,我們引入了一個新的基準資料集 CounterBench,其中包含 1K 個反事實推理問題。資料集的設計具有不同的難度等級、多樣化的因果圖結構、不同類型的反事實問題和多種無意義的名稱變體。我們的實驗表明,反事實推理對 LLM 構成重大挑戰,大多數模型的表現與隨機猜測相當。為了增強 LLM 的反事實推理能力,我們提出了一種新穎的推理範例 CoIn,它引導 LLM 透過反覆推理和回溯系統性地探索反事實解。實驗結果表明,我們的方法顯著提升 LLM 在反事實推理任務上的表現,並持續增強不同 LLM 的表現。我們的資料集可在 https://huggingface.co/datasets/CounterBench/CounterBench 取得。 +摘要:由於抗菌藥物抗性上升,找出具有抗生素潛力的新型化合物對於對抗此項全球性健康議題至關重要。不過,傳統的藥物開發方法成本高昂且效率不彰。研究人員體認到對於更有效解決方案的迫切需求,因此轉向機器學習技術來簡化新型抗生素化合物的預測和開發。儘管基礎模型在抗生素發現方面展現潛力,目前的普遍做法仍未充分利用多模態分子資料的潛力。最近的研究顯示,利用多模態資料的對比學習架構在各種領域的表徵學習中展現出優異的效能。有鑑於此,我們引進 CL-MFAP,一種無監督對比學習 (CL) 為基礎的多模態基礎 (MF) 模型,專門用於使用三種類型的分子資料發現具有潛在抗生素特性的低分子。此模型採用 ChEMBL 資料集中的 160 萬個具有類藥物特性的生物活性分子,以聯合預訓練三個編碼器:(1) 一個具有旋轉位置嵌入的基於Transformer的編碼器,用於處理 SMILES 字串;(2) 另一個基於Transformer的編碼器,結合一種新穎的雙層路由注意機制來處理分子圖表表徵;以及 (3) 一個使用多層感知器的 Morgan 指紋編碼器,以達成對比學習的目的。CL-MFAP 透過有效利用不同的分子模式在抗生素特性預測方面優於基準模型,並且在針對抗生素相關特性預測任務進行微調時展現出優異的特定領域效能。 -##### **RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation** -2502.10996v1 by Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han +##### **Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images** +2502.10908v1 by Sevim Cengiz, Ibraheem Hamdi, Mohammad Yaqub -Retrieval-augmented language models often struggle with knowledge-intensive -tasks due to inefficient retrieval, unstructured knowledge integration, and -single-pass architectures. We present Retrieval-And-Structuring (RAS), a novel -framework that dynamically constructs and reasons over query-specific knowledge -graphs through iterative retrieval and structuring. RAS introduces four key -technical innovations: (1) a themescoped retrieval mechanism that efficiently -narrows the search space while maintaining retrieval quality, (2) an action -planning module that determines knowledge needs and generates focused -sub-queries, (3) a dynamic knowledge structuring approach that converts -retrieved text into an evolving knowledge graph, and (4) a graph-augmented -answering component that leverages the accumulated structured information. Our -framework achieves state-of-the-art performance, surpassing leading baselines -by 6.4% with open-source language models and 7.0% with proprietary models on -seven knowledge-intensive generation datasets across all evaluation metrics. -Detailed ablation studies verify the contribution of each technical component -to the overall system performance. +Fetal gestational age (GA) is vital clinical information that is estimated +during pregnancy in order to assess fetal growth. This is usually performed by +measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan +which is then correlated with fetal age and growth trajectory. A major issue +when performing the CRL measurement is ensuring that the image is acquired at +the correct view, otherwise it could be misleading. Although clinical +guidelines specify the criteria for the correct CRL view, sonographers may not +regularly adhere to such rules. In this paper, we propose a new deep +learning-based solution that is able to verify the adherence of a CRL image to +clinical guidelines in order to assess image quality and facilitate accurate +estimation of GA. We first segment out important fetal structures then use the +localized structures to perform a clinically-guided mapping that verifies the +adherence of criteria. The segmentation method combines the benefits of +Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment +fetal structures in ultrasound images and localize important fetal landmarks. +For segmentation purposes, we compare our proposed work with UNet and show that +our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, +we compare the output of the mapping with classification CNNs when assessing +the clinical criteria and the overall acceptability of CRL images. We show that +the proposed mapping is not only explainable but also more accurate than the +best performing classification CNNs. -摘要:检索增强语言模型通常会因检索效率低、知识整合无结构和单次通过架构而难以胜任知识密集型任务。我们提出检索和结构化 (RAS),这是一个新颖的框架,通过迭代检索和结构化,动态构建和推理特定于查询的知识图谱。RAS 引入了四项关键技术创新:(1) 主题范围检索机制,在保持检索质量的同时有效缩小搜索空间,(2) 动作规划模块,确定知识需求并生成重点子查询,(3) 动态知识结构化方法,将检索到的文本转换为不断发展的知识图谱,以及 (4) 图谱增强型回答组件,利用累积的结构化信息。我们的框架实现了最先进的性能,在七个知识密集型生成数据集上,使用开源语言模型提高了 6.4%,使用专有模型提高了 7.0%,超越了领先的基线,且所有评估指标均如此。详细的消融研究验证了每个技术组件对整体系统性能的贡献。 +摘要:胎兒妊娠年齡 (GA) 是重要的臨床資訊,會在懷孕期間估計,以評估胎兒生長。這通常是透過在約會掃描中測量超音波影像中的頭臀長度 (CRL) 來執行,然後與胎兒年齡和生長軌跡相關聯。執行 CRL 測量時的一個主要問題是確保影像是在正確的視角下取得,否則可能會產生誤導。儘管臨床指南規定了正確 CRL 視角的標準,但超音波檢查員可能不會定期遵守這些規則。在本文中,我們提出了一個新的深度學習解決方案,能夠驗證 CRL 影像是否符合臨床指南,以評估影像品質並促進對 GA 的準確估計。我們首先分割出重要的胎兒結構,然後使用局部結構來執行臨床指導的對應,以驗證標準的遵守情況。分割方法結合了卷積神經網路 (CNN) 和視覺轉換器 (ViT) 的優點,以分割超音波影像中的胎兒結構並定位重要的胎兒標誌。為了分割目的,我們將我們提出的工作與 UNet 進行比較,並顯示我們基於 CNN/ViT 的方法優於 UNet 的最佳化版本。此外,我們在評估臨床標準和 CRL 影像的整體可接受性時,將對應的輸出與分類 CNN 進行比較。我們表明,所提出的對應不僅可以解釋,而且比效能最佳的分類 CNN 更準確。 -##### **Developing Conversational Speech Systems for Robots to Detect Speech Biomarkers of Cognition in People Living with Dementia** -2502.10896v1 by Rohith Perumandla, Young-Ho Bae, Diego Izaguirre, Esther Hwang, Andrew Murphy, Long-Jing Hsu, Selma Sabanovic, Casey C. Bennett +##### **Breaking Down the Hierarchy: A New Approach to Leukemia Classification** +2502.10899v1 by Ibraheem Hamdi, Hosam El-Gendy, Ahmed Sharshar, Mohamed Saeed, Muhammad Ridzuan, Shahrukh K. Hashmi, Naveed Syed, Imran Mirza, Shakir Hussain, Amira Mahmoud Abdalla, Mohammad Yaqub -This study presents the development and testing of a conversational speech -system designed for robots to detect speech biomarkers indicative of cognitive -impairments in people living with dementia (PLwD). The system integrates a -backend Python WebSocket server and a central core module with a large language -model (LLM) fine-tuned for dementia to process user input and generate robotic -conversation responses in real-time in less than 1.5 seconds. The frontend user -interface, a Progressive Web App (PWA), displays information and biomarker -score graphs on a smartphone in real-time to human users (PLwD, caregivers, -clinicians). Six speech biomarkers based on the existing literature - Altered -Grammar, Pragmatic Impairments, Anomia, Disrupted Turn-Taking, Slurred -Pronunciation, and Prosody Changes - were developed for the robot conversation -system using two datasets, one that included conversations of PLwD with a human -clinician (DementiaBank dataset) and one that included conversations of PLwD -with a robot (Indiana dataset). We also created a composite speech biomarker -that combined all six individual biomarkers into a single score. The speech -system's performance was first evaluated on the DementiaBank dataset showing -moderate correlation with MMSE scores, with the composite biomarker score -outperforming individual biomarkers. Analysis of the Indiana dataset revealed -higher and more variable biomarker scores, suggesting potential differences due -to study populations (e.g. severity of dementia) and the conversational -scenario (human-robot conversations are different from human-human). The -findings underscore the need for further research on the impact of -conversational scenarios on speech biomarkers and the potential clinical -applications of robotic speech systems. +The complexities inherent to leukemia, multifaceted cancer affecting white +blood cells, pose considerable diagnostic and treatment challenges, primarily +due to reliance on laborious morphological analyses and expert judgment that +are susceptible to errors. Addressing these challenges, this study presents a +refined, comprehensive strategy leveraging advanced deep-learning techniques +for the classification of leukemia subtypes. We commence by developing a +hierarchical label taxonomy, paving the way for differentiating between various +subtypes of leukemia. The research further introduces a novel hierarchical +approach inspired by clinical procedures capable of accurately classifying +diverse types of leukemia alongside reactive and healthy cells. An integral +part of this study involves a meticulous examination of the performance of +Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) as +classifiers. The proposed method exhibits an impressive success rate, achieving +approximately 90\% accuracy across all leukemia subtypes, as substantiated by +our experimental results. A visual representation of the experimental findings +is provided to enhance the model's explainability and aid in understanding the +classification process. -摘要:本研究展示了對話式語音系統的開發和測試,該系統專為機器人設計,用於偵測失智症患者(PLwD)認知障礙的語言生物標記。該系統整合了後端 Python WebSocket 伺服器和一個中央核心模組,其中包含針對失智症微調的大語言模型(LLM),以處理使用者輸入並在不到 1.5 秒的時間內產生機器人對話回應。前端使用者介面(漸進式網路應用程式,PWA)會在智慧型手機上即時向人類使用者(PLwD、照護者、臨床醫生)顯示資訊和生物標記評分圖表。根據現有文獻,針對機器人對話系統開發了六個語言生物標記:語法改變、實用障礙、失語症、輪流中斷、發音不清和韻律變化,使用了兩個資料集,一個包含 PLwD 與人類臨床醫生對話(DementiaBank 資料集),另一個包含 PLwD 與機器人對話(Indiana 資料集)。我們還建立了一個複合語言生物標記,將所有六個個別生物標記組合成一個單一評分。語言系統的效能首先在 DementiaBank 資料集上進行評估,顯示與 MMSE 評分有中等相關性,複合生物標記評分優於個別生物標記。對 Indiana 資料集的分析顯示出較高且變異性較大的生物標記評分,這表明由於研究族群(例如失智症的嚴重程度)和對話情境(人機對話與人際對話不同)而產生潛在差異。研究結果強調需要進一步研究對話情境對語言生物標記的影響,以及機器人語言系統的潛在臨床應用。 +摘要:白血病的复杂性源于它是一种影响白血球的多面性癌症,主要由于依赖费力的形态分析和容易出错的专家判断,因此带来了相当大的诊断和治疗挑战。为了应对这些挑战,本研究提出了一种精细且全面的策略,利用先进的深度学习技术对白血病亚型进行分类。我们首先开发了一个分层的标签分类法,为区分白血病的各种亚型铺平了道路。该研究进一步引入了一种新颖的分层方法,该方法受临床程序的启发,能够准确地对各种类型的白血病以及反应性和健康细胞进行分类。本研究的一个组成部分涉及对卷积神经网络 (CNN) 和视觉变压器 (ViT) 作为分类器的性能进行细致检查。所提出的方法展示了令人印象深刻的成功率,在所有白血病亚型中实现了大约 90% 的准确率,我们的实验结果证实了这一点。提供了实验结果的可视化表示,以增强模型的可解释性并帮助理解分类过程。 -##### **Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)** -2502.10768v1 by Sandra Schaftner +##### **An Empirical Analysis of Uncertainty in Large Language Model Evaluations** +2502.10709v1 by Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang -Current research highlights the great potential of Large Language Models -(LLMs) for constructing Scholarly Knowledge Graphs (SKGs). One particularly -complex step in this process is relation extraction, aimed at identifying -suitable properties to describe the content of research. This study builds -directly on previous research of three Open Research Knowledge Graph (ORKG) -team members who assessed the readiness of LLMs such as GPT-3.5, Llama 2, and -Mistral for property extraction in scientific literature. Given the moderate -performance observed, the previous work concluded that fine-tuning is needed to -improve these models' alignment with scientific tasks and their emulation of -human expertise. Expanding on this prior experiment, this study evaluates the -impact of advanced prompt engineering techniques and demonstrates that these -techniques can highly significantly enhance the results. Additionally, this -study extends the property extraction process to include property matching to -existing ORKG properties, which are retrieved via the API. The evaluation -reveals that results generated through advanced prompt engineering achieve a -higher proportion of matches with ORKG properties, further emphasizing the -enhanced alignment achieved. Moreover, this lays the groundwork for addressing -challenges such as the inconsistency of ORKG properties, an issue highlighted -in prior studies. By assigning unique URIs and using standardized terminology, -this work increases the consistency of the properties, fulfilling a crucial -aspect of Linked Data and FAIR principles - core commitments of ORKG. This, in -turn, significantly enhances the applicability of ORKG content for subsequent -tasks such as comparisons of research publications. Finally, the study -concludes with recommendations for future improvements in the overall property -extraction process. +As LLM-as-a-Judge emerges as a new paradigm for assessing large language +models (LLMs), concerns have been raised regarding the alignment, bias, and +stability of LLM evaluators. While substantial work has focused on alignment +and bias, little research has concentrated on the stability of LLM evaluators. +In this paper, we conduct extensive experiments involving 9 widely used LLM +evaluators across 2 different evaluation settings to investigate the +uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators +exhibit varying uncertainty based on model families and sizes. With careful +comparative analyses, we find that employing special prompting strategies, +whether during inference or post-training, can alleviate evaluation uncertainty +to some extent. By utilizing uncertainty to enhance LLM's reliability and +detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an +uncertainty-aware LLM evaluator named ConfiLM using a human-annotated +fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually +designed test set sourced from the 2024 Olympics. Experimental results +demonstrate that incorporating uncertainty as additional information during the +fine-tuning phase can largely improve the model's evaluation performance in OOD +scenarios. The code and data are released at: +https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty. -摘要:目前的調查強調大語言模型 (LLM) 在建構學術知識圖譜 (SKG) 上的巨大潛力。此過程中特別複雜的步驟是關係萃取,目標是找出合適的屬性來描述研究內容。本研究直接建立在三位開放研究知識圖譜 (ORKG) 團隊成員先前研究的基礎上,他們評估了 GPT-3.5、Llama 2 和 Mistral 等 LLM 在科學文獻中萃取屬性的準備情況。鑑於觀察到的表現中等,先前的研究結論是需要微調,以改善這些模型與科學任務的一致性,以及它們對人類專業知識的模擬。本研究擴展了先前的實驗,評估了進階提示工程技術的影響,並證明這些技術可以大幅顯著地提升結果。此外,本研究將屬性萃取流程擴展到包含與現有 ORKG 屬性的屬性比對,這些屬性是透過 API 擷取的。評估結果顯示,透過進階提示工程產生的結果與 ORKG 屬性有更高的比對比例,進一步強調所達成的進階一致性。此外,這也為了解決先前的研究中強調的問題,例如 ORKG 屬性的不一致性,奠定了基礎。透過指定唯一的 URI 並使用標準化的術語,本研究增加了屬性的相容性,達成了連結資料和 FAIR 原則的重要層面,這是 ORKG 的核心承諾。這反過來大幅提升了 ORKG 內容在後續任務中的適用性,例如研究出版品的比較。最後,本研究以針對整體屬性萃取流程未來改進的建議作為結論。 +摘要:隨著 LLM 作為法官的新典範出現,用於評估大型語言模型 (LLM) 的 LLM 評估器在對齊、偏差和穩定性方面引發了關注。儘管大量工作集中在對齊和偏差上,但很少有研究集中在 LLM 評估器的穩定性上。在本文中,我們進行了廣泛的實驗,涉及 9 個廣泛使用的 LLM 評估器,跨越 2 個不同的評估設定,以調查基於模型的 LLM 評估中的不確定性。我們精確指出 LLM 評估器根據模型系列和大小表現出不同的不確定性。通過仔細的比較分析,我們發現採用特殊的提示策略(無論是在推理過程中還是訓練後)可以在一定程度上緩解評估不確定性。通過利用不確定性來增強 LLM 在 Out-Of-Distribution (OOD) 數據中的可靠性和檢測能力,我們進一步微調了一個名為 ConfiLM 的不確定性感知 LLM 評估器,使用人工註釋的微調設置,並評估 ConfiLM 在手動設計的、來自 2024 年奧運會的測試集上的 OOD 評估能力。實驗結果表明,在微調階段將不確定性作為附加信息納入其中可以在很大程度上提高模型在 OOD 場景中的評估性能。代碼和數據發布於: +https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty。 -##### **K-Edit: Language Model Editing with Contextual Knowledge Awareness** -2502.10626v1 by Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan +##### **Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model** +2502.10707v1 by Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong -As the world changes, we need to be able to update our models and correct -false information without costly retraining. Knowledge-based model editing -enables precise modifications to the weights of large language models in order -to modify the information encoded within. Recent approaches have seen success -in enabling recall of edited information for thousands of edits at once. -However, these approaches fail to produce edits that account for associated -contextual information. We present K-Edit, an effective approach to generating -contextually consistent knowledge edits. By using knowledge graphs, which -maintain contextual consistency when an edge is edited, we are able to generate -additional \textit{contextual edits} that ensure consistency of related -information in the language model. Our experiments demonstrate significant -improvements in multi-hop question answering while maintaining the general -effectiveness and scalability of model edits. +Electrocardiogram (ECG) is essential for the clinical diagnosis of +arrhythmias and other heart diseases, but deep learning methods based on ECG +often face limitations due to the need for high-quality annotations. Although +previous ECG self-supervised learning (eSSL) methods have made significant +progress in representation learning from unannotated ECG data, they typically +treat ECG signals as ordinary time-series data, segmenting the signals using +fixed-size and fixed-step time windows, which often ignore the form and rhythm +characteristics and latent semantic relationships in ECG signals. In this work, +we introduce a novel perspective on ECG signals, treating heartbeats as words +and rhythms as sentences. Based on this perspective, we first designed the +QRS-Tokenizer, which generates semantically meaningful ECG sentences from the +raw ECG signals. Building on these, we then propose HeartLang, a novel +self-supervised learning framework for ECG language processing, learning +general representations at form and rhythm levels. Additionally, we construct +the largest heartbeat-based ECG vocabulary to date, which will further advance +the development of ECG language processing. We evaluated HeartLang across six +public ECG datasets, where it demonstrated robust competitiveness against other +eSSL methods. Our data and code are publicly available at +https://github.com/PKUDigitalHealth/HeartLang. -摘要:隨著世界變化,我們需要能夠更新我們的模型,並在不進行昂貴的重新訓練的情況下更正錯誤資訊。基於知識的模型編輯能夠對大型語言模型的權重進行精確修改,以便修改其中編碼的資訊。最近的方法在一次啟用數千次編輯的編輯資訊的召回方面取得了成功。然而,這些方法無法產生考慮相關上下文資訊的編輯。我們提出 K-Edit,這是一種產生上下文一致的知識編輯的有效方法。通過使用知識圖,在編輯邊緣時保持上下文一致性,我們能夠產生額外的「上下文編輯」,以確保語言模型中相關資訊的一致性。我們的實驗證明了多跳問題回答的顯著改進,同時保持了模型編輯的一般有效性和可擴充性。 +摘要:心電圖 (ECG) 對於心律不整和其他心臟疾病的臨床診斷至關重要,但基於心電圖的深度學習方法通常會因需要高品質註解而面臨限制。儘管先前的 ECG 自我監督學習 (eSSL) 方法在從未註解的 ECG 資料中學習表徵方面取得顯著進展,但它們通常將 ECG 訊號視為普通的時間序列資料,使用固定大小和固定步長的時窗對訊號進行分段,這通常會忽略 ECG 訊號中的形式和節律特徵以及潛在的語義關係。在這項工作中,我們對 ECG 訊號引入了新的觀點,將心跳視為單字,將節律視為句子。基於此觀點,我們首先設計了 QRS-Tokenizer,它從原始 ECG 訊號中產生語義有意義的 ECG 句子。在此基礎上,我們提出了 HeartLang,一種用於 ECG 語言處理的新型自我監督學習框架,在形式和節律層面上學習一般表徵。此外,我們構建了迄今為止最大的基於心跳的 ECG 詞彙表,這將進一步促進 ECG 語言處理的發展。我們在六個公開的 ECG 資料集上評估了 HeartLang,它展示了與其他 eSSL 方法相比的強大競爭力。我們的資料和程式碼可在 https://github.com/PKUDigitalHealth/HeartLang 公開取得。 + +##### **Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction** +2502.10689v1 by Leisheng Yu, Yanxiao Cai, Minxing Zhang, Xia Hu + +The burgeoning volume of electronic health records (EHRs) has enabled deep +learning models to excel in predictive healthcare. However, for high-stakes +applications such as diagnosis prediction, model interpretability remains +paramount. Existing deep learning diagnosis prediction models with intrinsic +interpretability often assign attention weights to every past diagnosis or +hospital visit, providing explanations lacking flexibility and succinctness. In +this paper, we introduce SHy, a self-explaining hypergraph neural network +model, designed to offer personalized, concise and faithful explanations that +allow for interventions from clinical experts. By modeling each patient as a +unique hypergraph and employing a message-passing mechanism, SHy captures +higher-order disease interactions and extracts distinct temporal phenotypes as +personalized explanations. It also addresses the incompleteness of the EHR data +by accounting for essential false negatives in the original diagnosis record. A +qualitative case study and extensive quantitative evaluations on two real-world +EHR datasets demonstrate the superior predictive performance and +interpretability of SHy over existing state-of-the-art models. + +摘要:隨著電子健康紀錄 (EHR) 數量的激增,深度學習模型在預測保健方面表現出色。然而,對於診斷預測等高風險應用,模型的可解釋性仍然至關重要。現有的具有內在可解釋性的深度學習診斷預測模型通常會為每個過去的診斷或醫院就診分配注意力權重,提供的解釋缺乏靈活性且簡潔性。在本文中,我們介紹了 SHy,這是一個自解釋的超圖神經網路模型,旨在提供個性化、簡潔且忠實的解釋,讓臨床專家可以進行干預。通過將每個患者建模為一個獨特的超圖並採用訊息傳遞機制,SHy 捕捉到了高階疾病交互作用,並提取出不同的時間表型作為個性化解釋。它還通過考慮原始診斷記錄中的基本假陰性來解決電子健康紀錄資料的不完整性。對兩個真實世界電子健康紀錄資料集進行的定性案例研究和廣泛的定量評估表明,SHy 在預測效能和可解釋性方面優於現有的最先進模型。 ##### **ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis** 2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan @@ -8803,1348 +8812,1339 @@ serving as a valuable resource for training LLM. 摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。 -##### **GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs** -2502.10522v1 by Shima Khoshraftar, Niaz Abedini, Amir Hajian - -The application of large language models (LLMs) to graph data has attracted a -lot of attention recently. LLMs allow us to use deep contextual embeddings from -pretrained models in text-attributed graphs, where shallow embeddings are often -used for the text attributes of nodes. However, it is still challenging to -efficiently encode the graph structure and features into a sequential form for -use by LLMs. In addition, the performance of an LLM alone, is highly dependent -on the structure of the input prompt, which limits their effectiveness as a -reliable approach and often requires iterative manual adjustments that could be -slow, tedious and difficult to replicate programmatically. In this paper, we -propose GraphiT (Graphs in Text), a framework for encoding graphs into a -textual format and optimizing LLM prompts for graph prediction tasks. Here we -focus on node classification for text-attributed graphs. We encode the graph -data for every node and its neighborhood into a concise text to enable LLMs to -better utilize the information in the graph. We then further programmatically -optimize the LLM prompts using the DSPy framework to automate this step and -make it more efficient and reproducible. GraphiT outperforms our LLM-based -baselines on three datasets and we show how the optimization step in GraphiT -leads to measurably better results without manual prompt tweaking. We also -demonstrated that our graph encoding approach is competitive to other graph -encoding methods while being less expensive because it uses significantly less -tokens for the same task. - -摘要:大型語言模型 (LLM) 在圖表資料的應用最近備受關注。LLM 讓我們能夠在文字標記圖表中使用預訓練模型的深度脈絡嵌入,其中淺層嵌入通常用於節點的文字屬性。然而,要有效率地將圖表結構和特徵編碼成序列形式供 LLM 使用,仍然是一項挑戰。此外,單獨 LLM 的效能高度依賴輸入提示的結構,這限制了它們作為可靠方法的有效性,而且通常需要反覆的人工調整,這可能會緩慢、繁瑣且難以透過程式複製。在本文中,我們提出 GraphiT(文字中的圖表),一個用於將圖表編碼成文字格式並最佳化 LLM 提示以進行圖表預測任務的架構。在這裡,我們專注於文字標記圖表的節點分類。我們將每個節點及其鄰域的圖表資料編碼成簡潔的文字,讓 LLM 能夠更好地利用圖表中的資訊。然後,我們進一步透過程式最佳化 LLM 提示,使用 DSPy 架構自動化這個步驟,並使其更有效率且可複製。Graphite 在三個資料集上優於我們的基於 LLM 的基準,我們展示了 GraphiT 中的最佳化步驟如何導致顯著更好的結果,而無需手動調整提示。我們還證明了我們的圖表編碼方法與其他圖表編碼方法具有競爭力,同時成本更低,因為它在相同的任務中使用了顯著更少的標記。 - -##### **Do Large Language Models Reason Causally Like Us? Even Better?** -2502.10215v1 by Hanna M. Dettki, Brenden M. Lake, Charley M. Wu, Bob Rehder - -Causal reasoning is a core component of intelligence. Large language models -(LLMs) have shown impressive capabilities in generating human-like text, -raising questions about whether their responses reflect true understanding or -statistical patterns. We compared causal reasoning in humans and four LLMs -using tasks based on collider graphs, rating the likelihood of a query variable -occurring given evidence from other variables. We find that LLMs reason -causally along a spectrum from human-like to normative inference, with -alignment shifting based on model, context, and task. Overall, GPT-4o and -Claude showed the most normative behavior, including "explaining away", whereas -Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected -independence of causes - Claude the least - they exhibited strong associative -reasoning and predictive inference when assessing the likelihood of the effect -given its causes. These findings underscore the need to assess AI biases as -they increasingly assist human decision-making. - -摘要:因果推理是智能的核心組成部分。大型語言模型 (LLM) 在生成類人文本方面展現了令人印象深刻的能力,引發了關於它們的回應是否反映真實理解或統計模式的疑問。我們使用基於碰撞圖的任務比較了人類和四個 LLM 中的因果推理,根據其他變數的證據評估查詢變數發生的可能性。我們發現 LLM 沿著從類人到規範推論的光譜進行因果推理,對齊會根據模型、上下文和任務而改變。總體而言,GPT-4o 和 Claude 表現出最規範的行為,包括「解釋」,而 Gemini-Pro 和 GPT-3.5 則沒有。儘管所有代理都偏離了預期的原因獨立性 - Claude 最不偏離 - 但它們在評估給定原因的效果可能性時表現出強烈的關聯推理和預測推論。這些發現強調了評估 AI 偏差的必要性,因為它們越來越協助人類決策。 - -##### **Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages** -2502.10140v1 by Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann +##### **Optimizing CNN Architectures for Advanced Thoracic Disease Classification** +2502.10614v1 by Tejas Mirthipati -Low-resource languages (LRLs) face significant challenges in natural language -processing (NLP) due to limited data. While current state-of-the-art large -language models (LLMs) still struggle with LRLs, smaller multilingual models -(mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of -their capacity to low training data sizes. This study systematically -investigates parameter-efficient adapter-based methods for adapting mLMs to -LRLs, evaluating three architectures: Sequential Bottleneck, Invertible -Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and -structured knowledge from ConceptNet, we show that small adaptation datasets -(e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains -in intrinsic (masked language modeling) and extrinsic tasks (topic -classification, sentiment analysis, and named entity recognition). We find that -Sequential Bottleneck adapters excel in language modeling, while Invertible -Bottleneck adapters slightly outperform other methods on downstream tasks due -to better embedding alignment and larger parameter counts. Adapter-based -methods match or outperform full fine-tuning while using far fewer parameters, -and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, -GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves -performance, pre-training data size remains the dominant factor, especially for -languages with extensive pre-training coverage. +Machine learning, particularly convolutional neural networks (CNNs), has +shown promise in medical image analysis, especially for thoracic disease +detection using chest X-ray images. In this study, we evaluate various CNN +architectures, including binary classification, multi-label classification, and +ResNet50 models, to address challenges like dataset imbalance, variations in +image quality, and hidden biases. We introduce advanced preprocessing +techniques such as principal component analysis (PCA) for image compression and +propose a novel class-weighted loss function to mitigate imbalance issues. Our +results highlight the potential of CNNs in medical imaging but emphasize that +issues like unbalanced datasets and variations in image acquisition methods +must be addressed for optimal model performance. -摘要:低資源語言 (LRL) 由於資料有限,在自然語言處理 (NLP) 中面臨重大挑戰。雖然當前最先進的大型語言模型 (LLM) 仍難以處理 LRL,但較小的多語言模型 (mLMS),例如 mBERT 和 XLM-R,由於其容量更適合低訓練資料大小,因此提供了更大的希望。本研究系統性地探討了基於參數效率適配器的適配方法,以將 mLMS 適配到 LRL,評估了三種架構:順序瓶頸、可逆瓶頸和低秩適配。使用來自 GlotCC 的非結構化文本和來自 ConceptNet 的結構化知識,我們表明小型適配資料集(例如,高達 1 GB 的自由文本或幾 MB 的知識圖譜資料)在內在(遮蔽語言模型)和外在任務(主題分類、情緒分析和命名實體識別)中產生增益。我們發現順序瓶頸適配器在語言模型中表現出色,而可逆瓶頸適配器由於更好的嵌入對齊和更大的參數數量,在下游任務上略勝於其他方法。基於適配器的方法在使用更少參數的同時,可以匹配或優於完全微調,而較小的 mLM 被證明比 LLaMA-3、GPT-4 和基於 DeepSeek-R1 的蒸餾模型等大型 LLM 更適合 LRL。雖然適配可以提高效能,但預訓練資料大小仍然是主要因素,特別是對於預訓練覆蓋範圍廣泛的語言。 +摘要:機器學習,特別是卷積神經網路 (CNN) 已在醫學影像分析中展現出潛力,特別是使用胸部 X 光影像進行胸腔疾病偵測。在此研究中,我們評估各種 CNN 架構,包括二元分類、多標籤分類和 ResNet50 模型,以解決資料集不平衡、影像品質差異和隱藏偏差等挑戰。我們導入進階前處理技術,例如主成分分析 (PCA) 以進行影像壓縮,並提出一個新穎的類別加權損失函數來緩解不平衡問題。我們的結果突顯了 CNN 在醫學影像中的潛力,但強調必須解決資料集不平衡和影像擷取方法差異等問題,才能獲得最佳模型效能。 -##### **Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models** -2502.10090v1 by Chenrui Tie, Shengxiang Sun, Jinxuan Zhu, Yiwei Liu, Jingxiang Guo, Yue Hu, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao +##### **PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation** +2502.10536v1 by Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner -Humans possess an extraordinary ability to understand and execute complex -manipulation tasks by interpreting abstract instruction manuals. For robots, -however, this capability remains a substantial challenge, as they cannot -interpret abstract instructions and translate them into executable actions. In -this paper, we present Manual2Skill, a novel framework that enables robots to -perform complex assembly tasks guided by high-level manual instructions. Our -approach leverages a Vision-Language Model (VLM) to extract structured -information from instructional images and then uses this information to -construct hierarchical assembly graphs. These graphs represent parts, -subassemblies, and the relationships between them. To facilitate task -execution, a pose estimation model predicts the relative 6D poses of components -at each assembly step. At the same time, a motion planning module generates -actionable sequences for real-world robotic implementation. We demonstrate the -effectiveness of Manual2Skill by successfully assembling several real-world -IKEA furniture items. This application highlights its ability to manage -long-horizon manipulation tasks with both efficiency and precision, -significantly enhancing the practicality of robot learning from instruction -manuals. This work marks a step forward in advancing robotic systems capable of -understanding and executing complex manipulation tasks in a manner akin to -human capabilities. +The interpretation of histopathology cases underlies many important +diagnostic and treatment decisions in medicine. Notably, this process typically +requires pathologists to integrate and summarize findings across multiple +slides per case. Existing vision-language capabilities in computational +pathology have so far been largely limited to small regions of interest, larger +regions at low magnification, or single whole-slide images (WSIs). This limits +interpretation of findings that span multiple high-magnification regions across +multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model +(LMM) with a 1-million token context window, we demonstrate the ability to +generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches +from multiple WSIs at 10X magnification. This is the equivalent of up to 11 +hours of video at 1 fps. Expert pathologist evaluations demonstrate that the +generated report text is clinically accurate and equivalent to or preferred +over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide +examples with up to 5 slides. While performance decreased for examples with 6 +or more slides, this study demonstrates the promise of leveraging the +long-context capabilities of modern LMMs for the uniquely challenging task of +medical report generation where each case can contain thousands of image +patches. -摘要:人類擁有理解並執行複雜操作任務的非凡能力,方法是詮釋抽象的說明手冊。然而,對機器人來說,這項能力仍然是一項重大的挑戰,因為它們無法詮釋抽象的指令並將其轉換為可執行的動作。在本文中,我們提出了 Manual2Skill,這是一個新穎的框架,使機器人能夠在高階手冊說明的指導下執行複雜的組裝任務。我們的做法利用視覺語言模型 (VLM) 從教學圖片中提取結構化資訊,然後使用此資訊來建構階層式組裝圖。這些圖表示零件、子組件以及它們之間的關係。為了促進任務執行,姿勢估計模型會預測每個組裝步驟中組件的相對 6D 姿勢。同時,動作規劃模組會產生適用於實際機器人實作的可操作順序。我們透過成功組裝幾個真實世界的 IKEA 家具來展示 Manual2Skill 的有效性。此應用程式突顯了它以高效率和高精準度管理長時程操作任務的能力,大幅提升機器人從說明手冊中學習的實用性。這項工作標誌著機器人系統在理解和執行複雜操作任務方面向前邁進了一步,其方式類似於人類的能力。 +摘要:組織病理學病例的解讀是許多重要的醫學診斷和治療決策的基礎。值得注意的是,這個過程通常需要病理學家整合和總結每個病例的許多玻片中的發現。迄今為止,計算機病理學中現有的視覺語言功能在很大程度上僅限於小範圍的感興趣區域、低倍率下的較大區域或單一的全玻片影像 (WSI)。這限制了跨多個 WSI 中多個高倍率區域的發現的解讀。通過使用 Gemini 1.5 Flash,一個具有 100 萬個令牌上下文視窗的大型多模態模型 (LMM),我們展示了從多個 WSI 中多達 40,000 個 768x768 像素圖像貼片(10 倍放大)生成底線診斷的能力。這相當於 1 fps 下長達 11 小時的影片。專家病理學家評估表明,生成的報告文字在臨床上是準確的,並且等同於或優於 68%(95% CI:[60%,76%])的多玻片範例(最多 5 個玻片)的原始報告。儘管對於有 6 個或更多玻片的範例,其性能下降,但這項研究證明了利用現代 LMM 的長上下文功能來應對獨特挑戰性的醫療報告生成任務,其中每個病例可能包含數千個影像貼片,這項任務的前景。 -##### **Decision Information Meets Large Language Models: The Future of Explainable Operations Research** -2502.09994v1 by Yansen Zhang, Qingcan Kang, Wing Yin Yu, Hailei Gong, Xiaojin Fu, Xiongwei Han, Tao Zhong, Chen Ma +##### **Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks** +2502.10526v2 by Venkatesh Sivaraman, Anika Vaishampayan, Xiaotong Li, Brian R Buck, Ziyong Ma, Richard D Boyce, Adam Perer -Operations Research (OR) is vital for decision-making in many industries. -While recent OR methods have seen significant improvements in automation and -efficiency through integrating Large Language Models (LLMs), they still -struggle to produce meaningful explanations. This lack of clarity raises -concerns about transparency and trustworthiness in OR applications. To address -these challenges, we propose a comprehensive framework, Explainable Operations -Research (EOR), emphasizing actionable and understandable explanations -accompanying optimization. The core of EOR is the concept of Decision -Information, which emerges from what-if analysis and focuses on evaluating the -impact of complex constraints (or parameters) changes on decision-making. -Specifically, we utilize bipartite graphs to quantify the changes in the OR -model and adopt LLMs to improve the explanation capabilities. Additionally, we -introduce the first industrial benchmark to rigorously evaluate the -effectiveness of explanations and analyses in OR, establishing a new standard -for transparency and clarity in the field. +Temporal predictive models have the potential to improve decisions in health +care, public services, and other domains, yet they often fail to effectively +support decision-makers. Prior literature shows that many misalignments between +model behavior and decision-makers' expectations stem from issues of model +specification, namely how, when, and for whom predictions are made. However, +model specifications for predictive tasks are highly technical and difficult +for non-data-scientist stakeholders to interpret and critique. To address this +challenge we developed Tempo, an interactive system that helps data scientists +and domain experts collaboratively iterate on model specifications. Using +Tempo's simple yet precise temporal query language, data scientists can quickly +prototype specifications with greater transparency about pre-processing +choices. Moreover, domain experts can assess performance within data subgroups +to validate that models behave as expected. Through three case studies, we +demonstrate how Tempo helps multidisciplinary teams quickly prune infeasible +specifications and identify more promising directions to explore. -摘要:作業研究 (OR) 對許多產業的決策制定至關重要。雖然近期的 OR 方法已透過整合大型語言模型 (LLM) 在自動化和效率方面取得顯著的進步,但它們在產生有意義的解釋方面仍面臨挑戰。這種缺乏明確性的情況會對 OR 應用中的透明度和可信度造成疑慮。為了應對這些挑戰,我們提出一個全面的架構,即可解釋作業研究 (EOR),強調在最佳化過程中提供可操作且易於理解的解釋。EOR 的核心是決策資訊的概念,它源自假設分析,並專注於評估複雜約束條件 (或參數) 變更對決策制定的影響。具體來說,我們利用二部圖量化 OR 模型的變化,並採用 LLM 來改善解釋能力。此外,我們引入了第一個產業基準,以嚴格評估 OR 中解釋和分析的有效性,為該領域的透明度和清晰度建立新的標準。 +摘要:時序預測模型有潛力改善醫療保健、公共服務和其他領域的決策,但它們經常無法有效支援決策者。先前的文獻顯示,模型行為與決策者期望之間的許多不一致源自於模型規範問題,也就是如何、何時以及針對誰進行預測。然而,預測任務的模型規範非常技術化,非數據科學家利害關係人難以解讀和批評。為了應對此挑戰,我們開發了 Tempo,一個互動式系統,可協助數據科學家和領域專家協同反覆運算模型規範。透過使用 Tempo 簡單但精確的時序查詢語言,數據科學家可以快速建構規範原型,並更透明地了解前處理的選擇。此外,領域專家可以評估資料子群組內的效能,以驗證模型是否如預期般運作。透過三個案例研究,我們展示 Tempo 如何協助跨領域團隊快速刪減不可行的規範,並找出更有希望探索的方向。 -##### **KGGen: Extracting Knowledge Graphs from Plain Text with Language Models** -2502.09956v1 by Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo +##### **A Robust Attack: Displacement Backdoor Attack** +2502.10490v1 by Yong Li, Han Gao -Recent interest in building foundation models for KGs has highlighted a -fundamental challenge: knowledge-graph data is relatively scarce. The -best-known KGs are primarily human-labeled, created by pattern-matching, or -extracted using early NLP techniques. While human-generated KGs are in short -supply, automatically extracted KGs are of questionable quality. We present a -solution to this data scarcity problem in the form of a text-to-KG generator -(KGGen), a package that uses language models to create high-quality graphs from -plaintext. Unlike other KG extractors, KGGen clusters related entities to -reduce sparsity in extracted KGs. KGGen is available as a Python library -(\texttt{pip install kg-gen}), making it accessible to everyone. Along with -KGGen, we release the first benchmark, Measure of of Information in Nodes and -Edges (MINE), that tests an extractor's ability to produce a useful KG from -plain text. We benchmark our new tool against existing extractors and -demonstrate far superior performance. +As artificial intelligence becomes more prevalent in our lives, people are +enjoying the convenience it brings, but they are also facing hidden threats, +such as data poisoning and adversarial attacks. These threats can have +disastrous consequences for the application of artificial intelligence, +especially for some applications that take effect immediately, such as +autonomous driving and medical fields. Among these threats, backdoor attacks +have left a deep impression on people with their concealment and simple +deployment, making them a threat that cannot be ignored, however, in the +process of deploying the backdoor model, the backdoor attack often has some +reasons that make it unsatisfactory in real-world applications, such as jitter +and brightness changes. Based on this, we propose a highly robust backdoor +attack that shifts the target sample and combines it with itself to form a +backdoor sample, the Displacement Backdoor Attack(DBA). Experimental results +show that the DBA attack can resist data augmentation that simulates real-world +differences, such as rotation and cropping. -摘要:最近对于构建知识图谱基础模型的兴趣凸显了一个基本挑战:知识图谱数据相对稀缺。最知名的知识图谱主要为人标注,由模式匹配创建,或使用早期自然语言处理技术提取。虽然人生成的知识图谱供不应求,但自动提取的知识图谱质量堪忧。我们以文本到知识图谱生成器 (KGGen) 的形式为这一数据稀缺问题提供了一个解决方案,这是一个使用语言模型从纯文本创建高质量图表的包。与其他知识图谱提取器不同,KGGen 对相关实体进行聚类以减少提取的知识图谱中的稀疏性。KGGen 可用作 Python 库(\texttt{pip install kg-gen}),使其所有人都能访问。除了 KGGen,我们还发布了第一个基准测试,即节点和边信息度量 (MINE),它测试了提取器从纯文本生成有用知识图谱的能力。我们针对现有提取器对我们的新工具进行基准测试,并展示了远超其性能。 +摘要:随着人工智能在我们的生活中变得越来越普遍,人们正在享受它带来的便利,但也面临着隐藏的威胁,例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果,特别是对于一些立即生效的应用,例如自动驾驶和医疗领域。在这些威胁中,后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象,使其成为不可忽视的威胁,然而,在部署后门模型的过程中,后门攻击往往存在一些使其在实际应用中不尽如人意的原因,例如抖动和亮度变化。基于此,我们提出了一种高度鲁棒的后门攻击,该攻击对目标样本进行平移并将其与自身结合以形成后门样本,即置换后门攻击 (DBA)。实验结果表明,DBA 攻击可以抵抗模拟真实世界差异的数据增强,例如旋转和裁剪。 -##### **ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation** -2502.09891v1 by Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, Yuchi Ma +##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification** +2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker -Retrieval-Augmented Generation (RAG) has proven effective in integrating -external knowledge into large language models (LLMs) for question-answer (QA) -tasks. The state-of-the-art RAG approaches often use the graph data as the -external data since they capture the rich semantic information and link -relationships between entities. However, existing graph-based RAG approaches -cannot accurately identify the relevant information from the graph and also -consume large numbers of tokens in the online retrieval process. To address -these issues, we introduce a novel graph-based RAG approach, called Attributed -Community-based Hierarchical RAG (ArchRAG), by augmenting the question using -attributed communities, and also introducing a novel LLM-based hierarchical -clustering method. To retrieve the most relevant information from the graph for -the question, we build a novel hierarchical index structure for the attributed -communities and develop an effective online retrieval method. Experimental -results demonstrate that ArchRAG outperforms existing methods in terms of both -accuracy and token cost. +Explainability remains a significant problem for AI models in medical +imaging, making it challenging for clinicians to trust AI-driven predictions. +We introduce 3D ReX, the first causality-based post-hoc explainability tool for +3D models. 3D ReX uses the theory of actual causality to generate +responsibility maps which highlight the regions most crucial to the model's +decision. We test 3D ReX on a stroke detection model, providing insight into +the spatial distribution of features relevant to stroke. -摘要:檢索增強生成 (RAG) 已證明可將外部知識整合到大型語言模型 (LLM),用於問答 (QA) 任務。最先進的 RAG 方法通常使用圖形資料作為外部資料,因為它們擷取了豐富的語意資訊和實體之間的連結關係。然而,現有的基於圖形的 RAG 方法無法準確識別圖形中的相關資訊,而且在線上檢索過程中也會消耗大量的符號。為了解決這些問題,我們提出了一種新穎的基於圖形的 RAG 方法,稱為基於屬性社群的分層 RAG (ArchRAG),透過使用屬性社群來擴充問題,並引入一種新穎的基於 LLM 的分層聚類方法。為了從圖形中檢索與問題最相關的資訊,我們為屬性社群建立了一個新穎的分層索引結構,並開發了一種有效的線上檢索方法。實驗結果證明,ArchRAG 在準確性和符號成本方面都優於現有方法。 +摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 +我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。 -##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing** -2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch +##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model** +2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott -Visual Question Answering (VQA) is a challenging problem that requires to -process multimodal input. Answer-Set Programming (ASP) has shown great -potential in this regard to add interpretability and explainability to modular -VQA architectures. In this work, we address the problem of how to integrate ASP -with modules for vision and natural language processing to solve a new and -demanding VQA variant that is concerned with images of graphs (not graphs in -symbolic form). Images containing graph-based structures are an ubiquitous and -popular form of visualisation. Here, we deal with the particular problem of -graphs inspired by transit networks, and we introduce a novel dataset that -amends an existing one by adding images of graphs that resemble metro lines. -Our modular neuro-symbolic approach combines optical graph recognition for -graph parsing, a pretrained optical character recognition neural network for -parsing labels, Large Language Models (LLMs) for language processing, and ASP -for reasoning. This method serves as a first baseline and achieves an overall -average accuracy of 73% on the dataset. Our evaluation provides further -evidence of the potential of modular neuro-symbolic systems, in particular with -pretrained models that do not involve any further training and logic -programming for reasoning, to solve complex VQA tasks. +In the analysis of remote healthcare monitoring data, time series +representation learning offers substantial value in uncovering deeper patterns +of patient behavior, especially given the fine temporal granularity of the +data. In this study, we focus on a dataset of home activity records from people +living with Dementia. We propose a two-stage self-supervised learning approach. +The first stage involves converting time-series activities into text strings, +which are then encoded by a fine-tuned language model. In the second stage, +these time-series vectors are bi-dimensionalized for applying PageRank method, +to analyze latent state transitions to quantitatively assess participants +behavioral patterns and identify activity biases. These insights, combined with +diagnostic data, aim to support personalized care interventions. -摘要:視覺問答(VQA)是一項具有挑戰性的問題,需要處理多模態輸入。答案集程式設計(ASP)在這方面顯示出巨大的潛力,可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中,我們探討如何將 ASP 與視覺和自然語言處理模組整合,以解決一個新的且要求嚴格的 VQA 變體,該變體與圖形影像(而非符號形式的圖形)有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡,我們處理受交通網路啟發的圖形特定問題,並引入一個新的資料集,透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型(LLM)進行語言處理,以及 ASP 進行推理。此方法作為第一個基準,在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力,特別是預先訓練的模型,這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理,以解決複雜的 VQA 任務。 +摘要:在遠程醫療監控數據分析中,時序表示學習在揭示患者行為的更深層模式方面提供了實質性的價值,特別是考慮到數據的精細時間粒度。在本研究中,我們專注於痴呆症患者居家活動記錄的數據集。我們提出了一種兩階段的自我監督學習方法。第一階段涉及將時序活動轉換為文本串,然後由微調語言模型編碼。在第二階段,這些時序向量被雙維化以應用 PageRank 方法,分析潛在狀態轉換以定量評估參與者的行為模式並識別活動偏差。這些見解與診斷數據相結合,旨在支持個性化護理干預。 -##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data** -2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai +##### **TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation** +2502.09931v1 by Ju-Hyeon Nam, Nur Suriza Syazwany, Sang-Chul Lee -The adoption of EHRs has expanded opportunities to leverage data-driven -algorithms in clinical care and research. A major bottleneck in effectively -conducting multi-institutional EHR studies is the data heterogeneity across -systems with numerous codes that either do not exist or represent different -clinical concepts across institutions. The need for data privacy further limits -the feasibility of including multi-institutional patient-level data required to -study similarities and differences across patient subgroups. To address these -challenges, we developed the GAME algorithm. Tested and validated across 7 -institutions and 2 languages, GAME integrates data in several levels: (1) at -the institutional level with knowledge graphs to establish relationships -between codes and existing knowledge sources, providing the medical context for -standard codes and their relationship to each other; (2) between institutions, -leveraging language models to determine the relationships between -institution-specific codes with established standard codes; and (3) quantifying -the strength of the relationships between codes using a graph attention -network. Jointly trained embeddings are created using transfer and federated -learning to preserve data privacy. In this study, we demonstrate the -applicability of GAME in selecting relevant features as inputs for AI-driven -algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. -We then highlight the application of GAME harmonized multi-institutional EHR -data in a study of Alzheimer's disease outcomes and suicide risk among patients -with mental health disorders, without sharing patient-level data outside -individual institutions. +Skip connection engineering is primarily employed to address the semantic gap +between the encoder and decoder, while also integrating global dependencies to +understand the relationships among complex anatomical structures in medical +image segmentation. Although several models have proposed transformer-based +approaches to incorporate global dependencies within skip connections, they +often face limitations in capturing detailed local features with high +computational complexity. In contrast, graph neural networks (GNNs) exploit +graph structures to effectively capture local and global features. Leveraging +these properties, we introduce an attentional cross-scale graph neural network +(ACS-GNN), which enhances the skip connection framework by converting +cross-scale feature maps into a graph structure and capturing complex +anatomical structures through node attention. Additionally, we observed that +deep learning models often produce uninformative feature maps, which degrades +the quality of spatial attention maps. To address this problem, we integrated +entropy-driven feature selection (EFS) with spatial attention, calculating an +entropy score for each channel and filtering out high-entropy feature maps. Our +innovative framework, TransGUNet, comprises ACS-GNN and EFS-based spatial +attentio} to effectively enhance domain generalizability across various +modalities by leveraging GNNs alongside a reliable spatial attention map, +ensuring more robust features within the skip connection. Through comprehensive +experiments and analysis, TransGUNet achieved superior segmentation performance +on six seen and eight unseen datasets, demonstrating significantly higher +efficiency compared to previous methods. -摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。 +摘要:跳躍連接工程主要用於解決編碼器和解碼器之間的語義鴻溝,同時還整合全局依賴關係以了解醫學影像分割中複雜解剖結構之間的關係。儘管有幾個模型提出了基於Transformer的架構來整合跳躍連接中的全局依賴關係,但它們在以高計算複雜度擷取詳細的局部特徵時常常面臨限制。相比之下,圖神經網路 (GNN) 利用圖結構有效擷取局部和全局特徵。利用這些屬性,我們引入了注意力跨尺度圖神經網路 (ACS-GNN),它通過將跨尺度特徵圖轉換為圖結構並通過節點注意力擷取複雜的解剖結構來增強跳躍連接框架。此外,我們觀察到深度學習模型通常會產生無意義的特徵圖,這會降低空間注意力圖的品質。為了解決這個問題,我們將熵驅動特徵選擇 (EFS) 與空間注意力整合在一起,為每個通道計算熵分數並濾出高熵特徵圖。我們創新的框架 TransGUNet 包含 ACS-GNN 和基於 EFS 的空間注意力,通過利用 GNN 以及可靠的空間注意力圖有效增強跨各種模態的域泛化能力,確保跳躍連接中更強大的特徵。透過全面的實驗和分析,TransGUNet 在六個已見和八個未見的資料集上實現了優異的分割效能,證明與先前的方法相比,效率顯著提高。 -##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy** -2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang +##### **Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos** +2502.09886v1 by Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, Pieter Abbeel -With the extensive application of Graph Neural Networks (GNNs) across various -domains, their trustworthiness has emerged as a focal point of research. Some -existing studies have shown that the integration of large language models -(LLMs) can improve the semantic understanding and generation capabilities of -GNNs, which in turn improves the trustworthiness of GNNs from various aspects. -Our review introduces a taxonomy that offers researchers a clear framework for -comprehending the principles and applications of different methods and helps -clarify the connections and differences among various approaches. Then we -systematically survey representative approaches along the four categories of -our taxonomy. Through our taxonomy, researchers can understand the applicable -scenarios, potential advantages, and limitations of each approach for the the -trusted integration of GNNs with LLMs. Finally, we present some promising -directions of work and future trends for the integration of LLMs and GNNs to -improve model trustworthiness. +Simulation offers a promising approach for cheaply scaling training data for +generalist policies. To scalably generate data from diverse and realistic +tasks, existing algorithms either rely on large language models (LLMs) that may +hallucinate tasks not interesting for robotics; or digital twins, which require +careful real-to-sim alignment and are hard to scale. To address these +challenges, we introduce Video2Policy, a novel framework that leverages +internet RGB videos to reconstruct tasks based on everyday human behavior. Our +approach comprises two phases: (1) task generation in simulation from videos; +and (2) reinforcement learning utilizing in-context LLM-generated reward +functions iteratively. We demonstrate the efficacy of Video2Policy by +reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, +which depicts diverse and complex human behaviors on 9 different tasks. Our +method can successfully train RL policies on such tasks, including complex and +challenging tasks such as throwing. Finally, we show that the generated +simulation data can be scaled up for training a general policy, and it can be +transferred back to the real robot in a Real2Sim2Real way. -摘要:隨著圖神經網路 (GNN) 在各種領域的廣泛應用,其可信度已成為研究的焦點。一些現有研究表明,整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力,進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法,為研究人員提供了一個清晰的架構,用於理解不同方法的原理和應用,並有助於釐清各種方法之間的關聯和差異。然後,我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法,可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後,我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢,以提升模型的可信度。 +摘要:模擬提供了一種有前途的方法,可以用於擴展訓練資料,以制定通才政策。為了從多樣化且逼真的任務中可擴充地產生資料,現有演算法仰賴大型語言模型 (LLM),這些模型可能會產生對機器人技術不感興趣的任務;或者仰賴數位雙胞胎,這需要仔細地將真實環境與模擬環境對齊,而且很難擴充。為了應對這些挑戰,我們引入了 Video2Policy,這是一個新穎的架構,它利用網路上的 RGB 影片,根據日常人類行為來重建任務。我們的做法包含兩個階段:(1) 從影片中在模擬環境中產生任務;以及 (2) 利用在情境中由 LLM 產生的獎勵函數,反覆進行強化學習。我們透過重建 Something-Something-v2 (SSv2) 資料集中的 100 多個影片來展示 Video2Policy 的效能,這些影片描繪了 9 項不同任務中多樣化且複雜的人類行為。我們的做法可以在這些任務上成功訓練 RL 政策,包括複雜且具挑戰性的任務,例如投擲。最後,我們展示了產生的模擬資料可以擴充到訓練一般政策,而且可以透過 Real2Sim2Real 的方式轉移回真實機器人。 -##### **Graph Foundation Models for Recommendation: A Comprehensive Survey** -2502.08346v3 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi +##### **HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation** +2502.09838v2 by Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi -Recommender systems (RS) serve as a fundamental tool for navigating the vast -expanse of online information, with deep learning advancements playing an -increasingly important role in improving ranking accuracy. Among these, graph -neural networks (GNNs) excel at extracting higher-order structural information, -while large language models (LLMs) are designed to process and comprehend -natural language, making both approaches highly effective and widely adopted. -Recent research has focused on graph foundation models (GFMs), which integrate -the strengths of GNNs and LLMs to model complex RS problems more efficiently by -leveraging the graph-based structure of user-item relationships alongside -textual understanding. In this survey, we provide a comprehensive overview of -GFM-based RS technologies by introducing a clear taxonomy of current -approaches, diving into methodological details, and highlighting key challenges -and future directions. By synthesizing recent advancements, we aim to offer -valuable insights into the evolving landscape of GFM-based recommender systems. +We present HealthGPT, a powerful Medical Large Vision-Language Model +(Med-LVLM) that integrates medical visual comprehension and generation +capabilities within a unified autoregressive paradigm. Our bootstrapping +philosophy is to progressively adapt heterogeneous comprehension and generation +knowledge to pre-trained large language models (LLMs). This is achieved through +a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is +complemented by a tailored hierarchical visual perception approach and a +three-stage learning strategy. To effectively learn the HealthGPT, we devise a +comprehensive medical domain-specific comprehension and generation dataset +called VL-Health. Experimental results demonstrate exceptional performance and +scalability of HealthGPT in medical visual unified tasks. Our project can be +accessed at https://github.com/DCDmllm/HealthGPT. -摘要:推薦系統 (RS) 是用於導航廣闊的線上資訊的基本工具,深度學習的進步在提升排名準確度方面扮演著日益重要的角色。其中,圖形神經網路 (GNN) 擅長萃取高階結構資訊,而大型語言模型 (LLM) 則設計用於處理和理解自然語言,這使得這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM),它整合了 GNN 和 LLM 的優點,透過利用使用者與項目關係的圖形化結構以及文字理解,更有效率地建構複雜的 RS 問題模型。在這項調查中,我們透過介紹當前方法的明確分類、深入探討方法論細節,以及強調關鍵挑戰和未來方向,提供了 GFM 為基礎的 RS 技術的全面概觀。透過綜合最近的進展,我們旨在提供對 GFM 為基礎的推薦系統不斷演變的版圖的寶貴見解。 +摘要:我們提出 HealthGPT,一種強大的醫學大型視覺語言模型 (Med-LVLM),它整合了醫學視覺理解和生成能力於一個統一的自動迴歸範例中。我們的引導哲學是逐步調整異質理解和生成知識以預先訓練大型語言模型 (LLM)。這是通過一種新穎的異質低秩適應 (H-LoRA) 技術實現的,該技術由量身定制的分層視覺感知方法和三階段學習策略補充。為了有效學習 HealthGPT,我們設計了一個全面的醫學領域特定理解和生成數據集,稱為 VL-Health。實驗結果證明了 HealthGPT 在醫學視覺統一任務中的卓越性能和可擴展性。我們的項目可以在 https://github.com/DCDmllm/HealthGPT 中訪問。 -##### **Self-Evaluation for Job-Shop Scheduling** -2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana +##### **Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games** +2502.09780v1 by Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi -Combinatorial optimization problems, such as scheduling and route planning, -are crucial in various industries but are computationally intractable due to -their NP-hard nature. Neural Combinatorial Optimization methods leverage -machine learning to address these challenges but often depend on sequential -decision-making, which is prone to error accumulation as small mistakes -propagate throughout the process. Inspired by self-evaluation techniques in -Large Language Models, we propose a novel framework that generates and -evaluates subsets of assignments, moving beyond traditional stepwise -approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a -heterogeneous graph neural network with a Transformer to build a policy model -and a self-evaluation function. Experimental validation on challenging, -well-known benchmarks demonstrates the effectiveness of our approach, -surpassing state-of-the-art methods. +Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of +applications involving the interaction of a group of agents in a shared unknown +environment. A prominent framework for studying MARL is Markov games, with the +goal of finding various notions of equilibria in a sample-efficient manner, +such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). +However, existing sample-efficient approaches either require tailored +uncertainty estimation under function approximation, or careful coordination of +the players. In this paper, we propose a novel model-based algorithm, called +VMG, that incentivizes exploration via biasing the empirical estimate of the +model parameters towards those with a higher collective best-response values of +all the players when fixing the other players' policies, thus encouraging the +policy to deviate from its current equilibrium for more exploration. VMG is +oblivious to different forms of function approximation, and permits +simultaneous and uncoupled policy updates of all players. Theoretically, we +also establish that VMG achieves a near-optimal regret for finding both the NEs +of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov +games under linear function approximation in an online environment, which +nearly match their counterparts with sophisticated uncertainty quantification. -摘要:組合優化問題,例如排程和路線規劃,在各行各業中至關重要,但由於它們的 NP 難度,在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰,但通常依賴於序貫決策制定,而序貫決策制定容易發生錯誤累積,因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發,我們提出了一個新的框架,可生成和評估作業子集,超越傳統的分步方法。應用於工作車間排程問題,我們的方法將異質圖神經網路與 Transformer 整合在一起,以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性,超越了最先進的方法。 +摘要:多智能體強化學習 (MARL) 是一系列應用程式的心臟,這些應用程式涉及一群智能體在一個共用未知環境中的互動。研究 MARL 的一個著名框架是馬可夫博弈,其目標是用樣本有效率的方式找出各種均衡概念,例如納許均衡 (NE) 和粗相關均衡 (CCE)。然而,現有的樣本有效率方法需要在函數逼近下進行量身打造的不確定性估計,或謹慎協調參與者。在本文中,我們提出了一種新的基於模型的演算法,稱為 VMG,它透過將模型參數的經驗估計值偏向於在固定其他參與者政策時所有參與者的集體最佳反應值,從而激勵探索,進而鼓勵政策偏離其當前均衡以進行更多探索。VMG 不會忽略函數逼近的不同形式,並允許所有參與者同時進行非耦合的政策更新。在理論上,我們也建立了 VMG 在線上環境中使用線性函數逼近來尋找雙人零和馬可夫博弈的 NE 和多人一般和馬可夫博弈的 CCE 時,會獲得接近最佳的後悔,這幾乎與其在不確定性量化方面更為複雜的對應物相匹配。 -##### **Improving Existing Optimization Algorithms with LLMs** -2502.08298v1 by Camilo Chacón Sartori, Christian Blum +##### **The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention** +2502.09757v1 by Bereket A. Yilma, Chan Mi Kim, Geke Ludden, Thomas van Rompay, Luis A. Leiva -The integration of Large Language Models (LLMs) into optimization has created -a powerful synergy, opening exciting research opportunities. This paper -investigates how LLMs can enhance existing optimization algorithms. Using their -pre-trained knowledge, we demonstrate their ability to propose innovative -heuristic variations and implementation strategies. To evaluate this, we -applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt -(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that -incorporates a heuristic in the solution construction phase. Our results show -that an alternative heuristic proposed by GPT-4o outperforms the -expert-designed heuristic of CMSA, with the performance gap widening on larger -and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/ +Post-intensive care syndrome (PICS) is a multifaceted condition that arises +from prolonged stays in an intensive care unit (ICU). While preventing PICS +among ICU patients is becoming increasingly important, interventions remain +limited. Building on evidence supporting the effectiveness of art exposure in +addressing the psychological aspects of PICS, we propose a novel art therapy +solution through a collaborative Human-AI approach that enhances personalized +therapeutic interventions using state-of-the-art Visual Art Recommendation +Systems. We developed two Human-in-the-Loop (HITL) personalization methods and +assessed their impact through a large-scale user study (N=150). Our findings +demonstrate that this Human-AI collaboration not only enhances the +personalization and effectiveness of art therapy but also supports therapists +by streamlining their workload. While our study centres on PICS intervention, +the results suggest that human-AI collaborative Art therapy could potentially +benefit other areas where emotional support is critical, such as cases of +anxiety and depression. -摘要:大型语言模型 (LLM) 与优化相结合,创造了一种强大的协同作用,开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识,我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点,我们应用了一种非平凡的优化算法,构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法,它在求解构建阶段纳入了启发式算法。我们的结果表明,GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法,并且随着图形变得更大、更密集,性能差距也在扩大。项目网址:https://imp-opt-algo-llms.surge.sh/ +摘要:重症後症候群 (PICS) 是一種多面向的疾病,源自於在加護病房 (ICU) 長期住院。雖然預防重症後症候群在加護病房患者中正變得越來越重要,但介入措施仍然有限。建立在支持藝術接觸在解決重症後症候群心理層面的證據上,我們提出一個創新的藝術療法解決方案,透過協作式的人工智慧方法,使用最先進的視覺藝術推薦系統,增強個人化的治療介入。我們開發了兩種人機迴路 (HITL) 個人化方法,並透過大規模使用者研究 (N=150) 評估其影響。我們的發現證明,這種人機協作不僅增強了藝術治療的個人化和有效性,也透過簡化治療師的工作量來提供支援。雖然我們的研究中心在重症後症候群介入,但結果顯示,人機協作藝術療法有可能對其他需要情緒支持的領域有益,例如焦慮和憂鬱症。 -##### **LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search** -2502.10459v1 by Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang +##### **A CNN Approach to Automated Detection and Classification of Brain Tumors** +2502.09731v1 by Md. Zahid Hasan, Abdullah Tamim, D. M. Asadujjaman, Md. Mahfujur Rahman, Md. Abu Ahnaf Mollick, Nosin Anjum Dristi, Abdullah-Al-Noman -Graph Neural Architecture Search (GNAS) facilitates the automatic design of -Graph Neural Networks (GNNs) tailored to specific downstream graph learning -tasks. However, existing GNAS approaches often require manual adaptation to new -graph search spaces, necessitating substantial code optimization and -domain-specific knowledge. To address this challenge, we present LLM4GNAS, a -toolkit for GNAS that leverages the generative capabilities of Large Language -Models (LLMs). LLM4GNAS includes an algorithm library for graph neural -architecture search algorithms based on LLMs, enabling the adaptation of GNAS -methods to new search spaces through the modification of LLM prompts. This -approach reduces the need for manual intervention in algorithm adaptation and -code modification. The LLM4GNAS toolkit is extensible and robust, incorporating -LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture -search, and LLM-enhanced hyperparameter optimization. Experimental results -indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving -both homogeneous and heterogeneous graphs. +Brain tumors require an assessment to ensure timely diagnosis and effective +patient treatment. Morphological factors such as size, location, texture, and +variable appearance complicate tumor inspection. Medical imaging presents +challenges, including noise and incomplete images. This research article +presents a methodology for processing Magnetic Resonance Imaging (MRI) data, +encompassing techniques for image classification and denoising. The effective +use of MRI images allows medical professionals to detect brain disorders, +including tumors. This research aims to categorize healthy brain tissue and +brain tumors by analyzing the provided MRI data. Unlike alternative methods +like Computed Tomography (CT), MRI technology offers a more detailed +representation of internal anatomical components, making it a suitable option +for studying data related to brain tumors. The MRI picture is first subjected +to a denoising technique utilizing an Anisotropic diffusion filter. The dataset +utilized for the models creation is a publicly accessible and validated Brain +Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE +was employed for data augmentation and dataset balancing. Convolutional Neural +Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for +the classification procedure. EfficientNet attained an accuracy of 98%, the +highest recorded. -摘要:圖形神經架構搜尋 (GNAS) 促進圖形神經網路 (GNN) 的自動設計,以符合特定下游圖形學習任務。然而,現有的 GNAS 方法通常需要手動調整至新的圖形搜尋空間,這需要大量的程式碼最佳化和領域特定知識。為了應對這項挑戰,我們提出 LLM4GNAS,一個利用大型語言模型 (LLM) 的生成能力的 GNAS 工具包。LLM4GNAS 包含一個基於 LLM 的圖形神經架構搜尋演算法函式庫,讓 GNAS 方法能夠透過修改 LLM 提示來適應新的搜尋空間。這種方法減少了演算法適應和程式碼修改中手動介入的需要。LLM4GNAS 工具包具有可擴充性和穩健性,整合了 LLM 增強的圖形特徵工程、LLM 增強的圖形神經架構搜尋和 LLM 增強的超參數最佳化。實驗結果表明,LLM4GNAS 在涉及同質和異質圖形的任務上優於現有的 GNAS 方法。 +摘要:腦腫瘤需要評估以確保及時診斷和有效的患者治療。大小、位置、質地和可變外觀等形態因素會使腫瘤檢查複雜化。醫學影像會呈現挑戰,包括雜訊和不完整的影像。本研究文章提出了一種處理磁共振影像 (MRI) 資料的方法,包含影像分類和去噪技術。有效使用 MRI 影像可讓醫護人員偵測腦部疾病,包括腫瘤。本研究旨在透過分析提供的 MRI 資料來分類健康的腦組織和腦瘤。與電腦斷層掃描 (CT) 等替代方法不同,MRI 技術提供了更詳細的內部解剖結構表示,使其成為研究與腦瘤相關資料的合適選擇。MRI 影像會先使用各向異性擴散濾波器進行去噪技術處理。用於建立模型的資料集是一個公開且經過驗證的腦腫瘤分類 (MRI) 資料庫,包含 3,264 個腦部 MRI 掃描。SMOTE 用於資料擴充和資料集平衡。卷積神經網路 (CNN),例如 ResNet152V2、VGG、ViT 和 EfficientNet,用於分類程序。EfficientNet 達到了 98% 的準確度,是記錄到的最高值。 -##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning** -2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari +##### **Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data** +2502.09715v1 by Yu Leng, Yingnan He, Colin Magdamo, Ana-Maria Vranceanu, Christine S. Ritchie, Shibani S. Mukerji, Lidia M. V. R. Moura, John R. Dickson, Deborah Blacker, Sudeshna Das -Identifying cause-and-effect relationships is critical to understanding -real-world dynamics and ultimately causal reasoning. Existing methods for -identifying event causality in NLP, including those based on Large Language -Models (LLMs), exhibit difficulties in out-of-distribution settings due to the -limited scale and heavy reliance on lexical cues within available benchmarks. -Modern benchmarks, inspired by probabilistic causal inference, have attempted -to construct causal graphs of events as a robust representation of causal -knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent -benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a -benchmark designed for discovery and reasoning over abstract causal events. -Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday -life events on the abstraction level. We propose a pipeline for identifying -abstractions for event generalizations from \texttt{GLUCOSE} -\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit -commonsense causal knowledge, from which we subsequently extract $1,4$K causal -pairs. Our experiments highlight the ongoing challenges of using statistical -methods and/or LLMs for automatic abstraction identification and causal -discovery in NLP. Nonetheless, we demonstrate that the abstract causal -knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA -reasoning performance in LLMs. +Identifying cognitive impairment within electronic health records (EHRs) is +crucial not only for timely diagnoses but also for facilitating research. +Information about cognitive impairment often exists within unstructured +clinician notes in EHRs, but manual chart reviews are both time-consuming and +error-prone. To address this issue, our study evaluates an automated approach +using zero-shot GPT-4o to determine stage of cognitive impairment in two +different tasks. First, we evaluated the ability of GPT-4o to determine the +global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who +visited the memory clinic at Massachusetts General Hospital (MGH), and achieved +a weighted kappa score of 0.83. Second, we assessed GPT-4o's ability to +differentiate between normal cognition, mild cognitive impairment (MCI), and +dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o +attained a weighted kappa score of 0.91 in comparison to specialist chart +reviews and 0.96 on cases that the clinical adjudicators rated with high +confidence. Our findings demonstrate GPT-4o's potential as a scalable chart +review tool for creating research datasets and assisting diagnosis in clinical +settings in the future. -摘要:找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法,包括基於大型語言模型 (LLM) 的方法,由於規模有限且過度依賴於可用基準中的詞彙線索,在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖,作為因果知識的強健表示,其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中,我們介紹 \texttt{ACCESS},一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同,\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道,用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象,\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集,我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此,我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。 +摘要:在電子健康記錄 (EHR) 中識別認知障礙不僅對及時診斷至關重要,也有助於促進研究。有關認知障礙的資訊通常存在於 EHR 中非結構化的臨床記錄中,但手動圖表審查既耗時又容易出錯。為了解決這個問題,我們的研究評估了一種自動化方法,使用零次學習的 GPT-4o 來確定兩種不同任務中的認知障礙分期。首先,我們評估了 GPT-4o 確定來自麻薩諸塞州總醫院 (MGH) 記憶診所 769 名患者的專科記錄的全球臨床痴呆評分 (CDR) 的能力,並獲得了 0.83 的加權 kappa 分數。其次,我們評估了 GPT-4o 在 860 名 Medicare 患者 3 年視窗中的所有記錄中區分正常認知、輕度認知障礙 (MCI) 和痴呆的能力。與專科圖表審查相比,GPT-4o 獲得了 0.91 的加權 kappa 分數,而對於臨床評審員以高度信心評估的病例,其加權 kappa 分數為 0.96。我們的研究結果證明了 GPT-4o 作為可擴充圖表審查工具的潛力,可用於建立研究資料集並協助未來臨床環境中的診斷。 + +##### **Metamorphic Testing for Pose Estimation Systems** +2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque + +Pose estimation systems are used in a variety of fields, from sports +analytics to livestock care. Given their potential impact, it is paramount to +systematically test their behaviour and potential for failure. This is a +complex task due to the oracle problem and the high cost of manual labelling +necessary to build ground truth keypoints. This problem is exacerbated by the +fact that different applications require systems to focus on different subjects +(e.g., human versus animal) or landmarks (e.g., only extremities versus whole +body and face), which makes labelled test data rarely reusable. To combat these +problems we propose MET-POSE, a metamorphic testing framework for pose +estimation systems that bypasses the need for manual annotation while assessing +the performance of these systems under different circumstances. MET-POSE thus +allows users of pose estimation systems to assess the systems in conditions +that more closely relate to their application without having to label an ad-hoc +test dataset or rely only on available datasets, which may not be adapted to +their application domain. While we define MET-POSE in general terms, we also +present a non-exhaustive list of metamorphic rules that represent common +challenges in computer vision applications, as well as a specific way to +evaluate these rules. We then experimentally show the effectiveness of MET-POSE +by applying it to Mediapipe Holistic, a state of the art human pose estimation +system, with the FLIC and PHOENIX datasets. With these experiments, we outline +numerous ways in which the outputs of MET-POSE can uncover faults in pose +estimation systems at a similar or higher rate than classic testing using hand +labelled data, and show that users can tailor the rule set they use to the +faults and level of accuracy relevant to their application. -##### **Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality** -2502.09658v1 by Xin Kang, Veronika Shteingardt, Yuhan Wang, Dov Dori +摘要:姿勢估計系統應用於各種領域,從運動分析到牲畜照護。鑑於其潛在影響,系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高,這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體(例如,人類對動物)或地標(例如,只有四肢對全身和臉部)而加劇,這使得標記的測試數據很少可以重複使用。為了解決這些問題,我們提出了 MET-POSE,這是一個姿勢估計系統的變形測試框架,在評估這些系統在不同情況下的性能時,可以繞過手動註解的需要。因此,MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統,而無需標記臨時測試數據集或僅依賴可用數據集,這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE,但我們也提供了一個非詳盡的變形規則列表,這些規則代表了電腦視覺應用中的常見挑戰,以及評估這些規則的具體方法。然後,我們通過將 MET-POSE 應用於 Mediapipe Holistic(一種先進的人類姿勢估計系統),並使用 FLIC 和 PHOENIX 數據集,以實驗方式展示 MET-POSE 的有效性。通過這些實驗,我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法,其速度與使用手動標記數據的傳統測試類似或更高,並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。 -Knowledge representation and reasoning are critical challenges in Artificial -Intelligence (AI), particularly in integrating neural and symbolic approaches -to achieve explainable and transparent AI systems. Traditional knowledge -representation methods often fall short of capturing complex processes and -state changes. We introduce Neuro-Conceptual Artificial Intelligence (NCAI), a -specialization of the neuro-symbolic AI approach that integrates conceptual -modeling using Object-Process Methodology (OPM) ISO 19450:2024 with deep -learning to enhance question-answering (QA) quality. By converting natural -language text into OPM models using in-context learning, NCAI leverages the -expressive power of OPM to represent complex OPM elements-processes, objects, -and states-beyond what traditional triplet-based knowledge graphs can easily -capture. This rich structured knowledge representation improves reasoning -transparency and answer accuracy in an OPM-QA system. We further propose -transparency evaluation metrics to quantitatively measure how faithfully the -predicted reasoning aligns with OPM-based conceptual logic. Our experiments -demonstrate that NCAI outperforms traditional methods, highlighting its -potential for advancing neuro-symbolic AI by providing rich knowledge -representations, measurable transparency, and improved reasoning. +##### **Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling** +2502.09688v1 by Benjamin D. Killeen, Bohua Wan, Aditya V. Kulkarni, Nathan Drenkow, Michael Oberst, Paul H. Yi, Mathias Unberath -摘要:知識表徵與推理是人工智慧 (AI) 中的重大挑戰,特別是在整合神經與符號方法以實現可解釋且透明的人工智慧系統時。傳統的知識表徵方法通常無法捕捉複雜的流程和狀態變化。我們引入了神經概念人工智慧 (NCAI),一種神經符號 AI 方法的專門化,它將使用物件流程方法 (OPM) ISO 19450:2024 的概念建模與深度學習整合在一起,以提升問答 (QA) 的品質。透過使用情境學習將自然語言文字轉換為 OPM 模型,NCAI 充分利用 OPM 的表達能力來表徵複雜的 OPM 元素(流程、物件和狀態),超越傳統的三元組知識圖表容易捕捉的範圍。這種豐富的結構化知識表徵改善了 OPM-QA 系統中的推理透明度和答案準確度。我們進一步提出了透明度評估指標,以量化測量預測推理與基於 OPM 的概念邏輯的吻合程度。我們的實驗證明,NCAI 優於傳統方法,突顯了它在透過提供豐富的知識表徵、可測量的透明度和改善的推理來推進神經符號 AI 的潛力。 +Artificial intelligence (AI) is poised to transform healthcare by enabling +personalized and efficient care through data-driven insights. Although +radiology is at the forefront of AI adoption, in practice, the potential of AI +models is often overshadowed by severe failures to generalize: AI models can +have performance degradation of up to 20% when transitioning from controlled +test environments to clinical use by radiologists. This mismatch raises +concerns that radiologists will be misled by incorrect AI predictions in +practice and/or grow to distrust AI, rendering these promising technologies +practically ineffectual. Exhaustive clinical trials of AI models on abundant +and diverse data is thus critical to anticipate AI model degradation when +encountering varied data samples. Achieving these goals, however, is +challenging due to the high costs of collecting diverse data samples and +corresponding annotations. To overcome these limitations, we introduce a novel +conditional generative AI model designed for virtual clinical trials (VCTs) of +radiology AI, capable of realistically synthesizing full-body CT images of +patients with specified attributes. By learning the joint distribution of +images and anatomical structures, our model enables precise replication of +real-world patient populations with unprecedented detail at this scale. We +demonstrate meaningful evaluation of radiology AI models through VCTs powered +by our synthetic CT study populations, revealing model degradation and +facilitating algorithmic auditing for bias-inducing data attributes. Our +generative AI approach to VCTs is a promising avenue towards a scalable +solution to assess model robustness, mitigate biases, and safeguard patient +care by enabling simpler testing and evaluation of AI models in any desired +range of diverse patient populations. -##### **GCoT: Chain-of-Thought Prompt Learning for Graphs** -2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang +摘要:人工智慧 (AI) 準備透過資料驅動的見解,轉型醫療保健,並提供個人化且有效率的照護。儘管放射科處於 AI 採用的最前線,但在實務上,AI 模型的潛力往往會被嚴重的概化失敗所掩蓋:AI 模型在從受控測試環境轉移到放射科醫師的臨床使用時,效能可能會降低多達 20%。這種不匹配引發了疑慮,即放射科醫師在實務上會被不正確的 AI 預測誤導,和/或開始不信任 AI,讓這些有前景的技術在實務上形同失效。因此,在 AI 模型遭遇各種資料範例時,預期 AI 模型的衰退,對豐富且多樣化的資料進行 AI 模型的全面臨床試驗至關重要。然而,由於收集多樣化的資料範例和對應註解的成本很高,實現這些目標具有挑戰性。為了克服這些限制,我們引進一個創新的條件式生成式 AI 模型,專門用於放射科 AI 的虛擬臨床試驗 (VCT),能夠真實地合成具有特定屬性的病患全身電腦斷層 (CT) 影像。透過學習影像和解剖結構的聯合分佈,我們的模型能夠以空前的細節精確複製真實世界的病患族群。我們透過由我們合成的電腦斷層研究族群支援的 VCT,展示了放射科 AI 模型有意義的評估,揭露模型衰退,並促進演算法稽核,以找出導致偏差的資料屬性。我們對 VCT 的生成式 AI 方法,是一個有前景的途徑,可以評估模型的穩健性、減輕偏差,並透過在任何所需的各種病患族群中,進行更簡單的 AI 模型測試和評估,來保障病患照護。 -Chain-of-thought (CoT) prompting has achieved remarkable success in natural -language processing (NLP). However, its vast potential remains largely -unexplored for graphs. This raises an interesting question: How can we design -CoT prompting for graphs to guide graph models to learn step by step? On one -hand, unlike natural languages, graphs are non-linear and characterized by -complex topological structures. On the other hand, many graphs lack textual -data, making it difficult to formulate language-based CoT prompting. In this -work, we propose the first CoT prompt learning framework for text-free graphs, -GCoT. Specifically, we decompose the adaptation process for each downstream -task into a series of inference steps, with each step consisting of -prompt-based inference, ``thought'' generation, and thought-conditioned prompt -learning. While the steps mimic CoT prompting in NLP, the exact mechanism -differs significantly. Specifically, at each step, an input graph, along with a -prompt, is first fed into a pre-trained graph encoder for prompt-based -inference. We then aggregate the hidden layers of the encoder to construct a -``thought'', which captures the working state of each node in the current step. -Conditioned on this thought, we learn a prompt specific to each node based on -the current state. These prompts are fed into the next inference step, -repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we -conduct comprehensive experiments on eight public datasets, which demonstrate -the advantage of our approach. +##### **Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models** +2502.09687v1 by Wiktoria Mieleszczenko-Kowszewicz, Beata Bajcar, Jolanta Babiak, Berenika Dyczek, Jakub Świstak, Przemysław Biecek -摘要:鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而,其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題:我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習?一方面,與自然語言不同,圖形是非線性的,並且具有複雜的拓撲結構。另一方面,許多圖形缺乏文本數據,這使得難以制定基於語言的 CoT 提示。在這項工作中,我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說,我們將每個下游任務的適應過程分解為一系列推理步驟,每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示,但具體機制卻有很大不同。具體來說,在每一步中,一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後,我們聚合編碼器的隱藏層以構建一個「思想」,它捕獲了當前步驟中每個節點的工作狀態。基於這個思想,我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中,重複這個循環。為了評估和分析 GCoT 的有效性,我們對八個公共數據集進行了全面的實驗,這證明了我們方法的優勢。 +Be careful what you ask for, you just might get it. This saying fits with the +way large language models (LLMs) are trained, which, instead of being rewarded +for correctness, are increasingly rewarded for pleasing the recipient. So, they +are increasingly effective at persuading us that their answers are valuable. +But what tricks do they use in this persuasion? In this study, we examine what +are the psycholinguistic features of the responses used by twelve different +language models. By grouping response content according to rational or +emotional prompts and exploring social influence principles employed by LLMs, +we ask whether and how we can mitigate the risks of LLM-driven mass +misinformation. We position this study within the broader discourse on +human-centred AI, emphasizing the need for interdisciplinary approaches to +mitigate cognitive and societal risks posed by persuasive AI responses. -##### **Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach** -2502.10453v1 by Régnier Avice, Bernhard Haslhofer, Zhidong Li, Jianlong Zhou +摘要:小心你要求的,你可能真的會得到。這句話適用於大型語言模型 (LLM) 的訓練方式,它們不是因為正確性而獲得獎勵,而是因為取悅接收者而獲得越來越多的獎勵。因此,它們越來越有效地說服我們,它們的答案是有價值的。但是它們在這種說服中使用什麼技巧呢?在這項研究中,我們探討了十二種不同的語言模型使用的回應的心理語言特徵。通過根據理性和情緒提示對回應內容進行分組,並探討 LLM 使用的社會影響原則,我們探討是否以及如何減輕 LLM 驅動的大規模錯誤信息的風險。我們將這項研究定位在以人為中心的 AI 的更廣泛討論中,強調需要跨學科方法來減輕具有說服力的 AI 回應帶來的認知和社會風險。 -Attribution tags form the foundation of modern cryptoasset forensics. -However, inconsistent or incorrect tags can mislead investigations and even -result in false accusations. To address this issue, we propose a novel -computational method based on Large Language Models (LLMs) to link attribution -tags with well-defined knowledge graph concepts. We implemented this method in -an end-to-end pipeline and conducted experiments showing that our approach -outperforms baseline methods by up to 37.4% in F1-score across three publicly -available attribution tag datasets. By integrating concept filtering and -blocking procedures, we generate candidate sets containing five knowledge graph -entities, achieving a recall of 93% without the need for labeled data. -Additionally, we demonstrate that local LLM models can achieve F1-scores of -90%, comparable to remote models which achieve 94%. We also analyze the -cost-performance trade-offs of various LLMs and prompt templates, showing that -selecting the most cost-effective configuration can reduce costs by 90%, with -only a 1% decrease in performance. Our method not only enhances attribution tag -quality but also serves as a blueprint for fostering more reliable forensic -evidence. +##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics** +2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing -摘要:歸因標籤構成現代加密資產鑑識的基礎。 -然而,不一致或不正確的標籤會誤導調查,甚至導致錯誤的指控。為了解決這個問題,我們提出了一種基於大型語言模型 (LLM) 的新型計算方法,將歸因標籤與定義明確的知識圖譜概念連結起來。我們在端到端管道中實施了這種方法,並進行了實驗,結果顯示我們的做法在三個公開可用的歸因標籤資料集中,F1 分數比基線方法高出 37.4%。透過整合概念過濾和封鎖程序,我們生成了包含五個知識圖譜實體的候選集,在不需要標籤資料的情況下,達到了 93% 的召回率。 -此外,我們證明了本機 LLM 模型可以達到 90% 的 F1 分數,與達到 94% 的遠端模型相當。我們也分析了各種 LLM 和提示範本的成本效益權衡,結果顯示選擇最具成本效益的設定可以將成本降低 90%,而效能只下降 1%。我們的做法不僅提升了歸因標籤的品質,也作為促進更可靠鑑識證據的藍圖。 +Joint entity-relation extraction is a critical task in transforming +unstructured or semi-structured text into triplets, facilitating the +construction of large-scale knowledge graphs, and supporting various downstream +applications. Despite its importance, research on Chinese text, particularly +with complex semantics in specialized domains like medicine, remains limited. +To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions +dataset designed to capture the intricacies of medical text. Leveraging the +strengths of attention mechanisms in capturing long-range dependencies, we +propose the SEA module, which enhances the extraction of complex contextual +semantic information, thereby improving entity recognition and relation +extraction. Additionally, to address the inefficiencies of existing methods in +facilitating information exchange between entity recognition and relation +extraction, we present an interactive fusion representation module. This module +employs Cross Attention for bidirectional information exchange between the +tasks and further refines feature extraction through BiLSTM. Experimental +results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that +our model exhibits strong generalization capabilities. On the CH-DDI dataset, +our model achieves an F1-score of 96.73% for entity recognition and 78.43% for +relation extraction. On the CoNLL04 dataset, it attains an entity recognition +precision of 89.54% and a relation extraction accuracy of 71.64%. -##### **Deep Semantic Graph Learning via LLM based Node Enhancement** -2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen +摘要:聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務,有助於建構大規模知識圖譜,並支援各種下游應用程式。儘管其重要性,但針對中文文本的研究,特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距,我們引入了 CH-DDI,一個中文藥物-藥物交互作用資料集,旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢,我們提出了 SEA 模組,增強了複雜脈絡語義資訊的抽取,從而改進了實體辨識和關係抽取。此外,為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題,我們提出了互動式融合表示模組。此模組採用交叉注意力,在任務之間進行雙向資訊交換,並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明,我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上,我們的模型在實體辨識方面達到了 96.73% 的 F1 分數,在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上,它在實體辨識方面達到了 89.54% 的準確度,在關係抽取方面達到了 71.64% 的準確度。 -Graph learning has attracted significant attention due to its widespread -real-world applications. Current mainstream approaches rely on text node -features and obtain initial node embeddings through shallow embedding learning -using GNNs, which shows limitations in capturing deep textual semantics. Recent -advances in Large Language Models (LLMs) have demonstrated superior -capabilities in understanding text semantics, transforming traditional text -feature processing. This paper proposes a novel framework that combines Graph -Transformer architecture with LLM-enhanced node features. Specifically, we -leverage LLMs to generate rich semantic representations of text nodes, which -are then processed by a multi-head self-attention mechanism in the Graph -Transformer to capture both local and global graph structural information. Our -model utilizes the Transformer's attention mechanism to dynamically aggregate -neighborhood information while preserving the semantic richness provided by LLM -embeddings. Experimental results demonstrate that the LLM-enhanced node -features significantly improve the performance of graph learning models on node -classification tasks. This approach shows promising results across multiple -graph learning tasks, offering a practical direction for combining graph -networks with language models. +##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine** +2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh -摘要:圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵,並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入,這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力,轉換了傳統的文本特徵處理。本文提出了一種新的框架,將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說,我們利用 LLM 來生成文本節點的豐富語義表示,然後在圖形轉換器中由多頭自我注意機制處理,以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息,同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明,LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果,為將圖形網絡與語言模型相結合提供了實用的方向。 +Generative artificial intelligence (AI) models, such as diffusion models and +OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy +and automating clinical workflows. The field has advanced rapidly, evolving +from text-only large language models for tasks such as clinical documentation +and decision support to multimodal AI systems capable of integrating diverse +data modalities, including imaging, text, and structured data, within a single +model. The diverse landscape of these technologies, along with rising interest, +highlights the need for a comprehensive review of their applications and +potential. This scoping review explores the evolution of multimodal AI, +highlighting its methods, applications, datasets, and evaluation in clinical +settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed, +IEEE Xplore, and Web of Science, prioritizing recent studies published up to +the end of 2024. After rigorous screening, 144 papers were included, revealing +key trends and challenges in this dynamic field. Our findings underscore a +shift from unimodal to multimodal approaches, driving innovations in diagnostic +support, medical report generation, drug discovery, and conversational AI. +However, critical challenges remain, including the integration of heterogeneous +data types, improving model interpretability, addressing ethical concerns, and +validating AI systems in real-world clinical settings. This review summarizes +the current state of the art, identifies critical gaps, and provides insights +to guide the development of scalable, trustworthy, and clinically impactful +multimodal AI solutions in healthcare. -##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping** -2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia +摘要:生成式人工智能 (AI) 模型,例如扩散模型和 OpenAI 的 ChatGPT,通过提高诊断准确性和自动化临床工作流程,正在改变医学领域。该领域已迅速发展,从用于临床文件编制和决策支持等任务的纯文本大型语言模型,发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣,凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变,重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南,我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science,优先考虑截至 2024 年底发表的最新研究。经过严格筛选,纳入了 144 篇论文,揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变,推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而,关键挑战仍然存在,包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术,确定了关键差距,并提供了见解,以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。 -The prototyping of computer games, particularly card games, requires -extensive human effort in creative ideation and gameplay evaluation. Recent -advances in Large Language Models (LLMs) offer opportunities to automate and -streamline these processes. However, it remains challenging for LLMs to design -novel game mechanics beyond existing databases, generate consistent gameplay -environments, and develop scalable gameplay AI for large-scale evaluations. -This paper addresses these challenges by introducing a comprehensive automated -card game prototyping framework. The approach highlights a graph-based indexing -method for generating novel game designs, an LLM-driven system for consistent -game code generation validated by gameplay records, and a gameplay AI -constructing method that uses an ensemble of LLM-generated action-value -functions optimized through self-play. These contributions aim to accelerate -card game prototyping, reduce human labor, and lower barriers to entry for game -developers. +##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration** +2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano -摘要:電腦遊戲,尤其是卡牌遊戲的原型製作,需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而,LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境,以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法,用於生成新穎的遊戲設計,一個由 LLM 驅動的系統,用於一致的遊戲程式碼生成,並由遊戲記錄驗證,以及一個遊戲 AI 構建方法,該方法使用由 LLM 生成的動作值函數的集合,通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作,減少人力,並降低遊戲開發人員的進入門檻。 +This paper presents a complete explainable system that interprets a set of +data, abstracts the underlying features and describes them in a natural +language of choice. The system relies on two crucial stages: (i) identifying +emerging properties from data and transforming them into abstract concepts, and +(ii) converting these concepts into natural language. Despite the impressive +natural language generation capabilities demonstrated by Large Language Models, +their statistical nature and the intricacy of their internal mechanism still +force us to employ these techniques as black boxes, forgoing trustworthiness. +Developing an explainable pipeline for data interpretation would allow +facilitating its use in safety-critical environments like processing medical +information and allowing non-experts and visually impaired people to access +narrated information. To this end, we believe that the fields of knowledge +representation and automated reasoning research could present a valid +alternative. Expanding on prior research that tackled the first stage (i), we +focus on the second stage, named Concept2Text. Being explainable, data +translation is easily modeled through logic-based rules, once again emphasizing +the role of declarative programming in achieving AI explainability. This paper +explores a Prolog/CLP-based rewriting system to interpret concepts-articulated +in terms of classes and relations, plus common knowledge-derived from a generic +ontology, generating natural language text. Its main features include +hierarchical tree rewritings, modular multilingual generation, support for +equivalent variants across semantic, grammar, and lexical levels, and a +transparent rule-based system. We outline the architecture and demonstrate its +flexibility through some examples capable of generating numerous diverse and +equivalent rewritings based on the input concept. -##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units** -2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan +摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。 -Graph Neural Networks (GNNs) are vital for learning from graph-structured -data, enabling applications in network analysis, recommendation systems, and -speech analytics. Deploying them on edge devices like client PCs and laptops -enhances real-time processing, privacy, and cloud independence. GNNs aid -Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and -enable event-based vision tasks. However, irregular memory access, sparsity, -and dynamic structures cause high latency and energy overhead on -resource-constrained devices. While modern edge processors integrate CPUs, -GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular -GNN computations. We introduce GraNNite, the first hardware-aware framework -optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN -accelerators via a structured three-step methodology: (1) enabling NPU -execution, (2) optimizing performance, and (3) trading accuracy for efficiency -gains. Step 1 employs GraphSplit for workload distribution and StaGr for static -aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts -performance using EffOp for control-heavy tasks and GraSp for sparsity -exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce -redundancy and memory transfers. Step 3 balances quality versus efficiency, -where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate -attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, -GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to -8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher -performance than CPUs and GPUs, respectively, across GNN models. +##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York** +2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu -摘要:圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要,能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置(例如用戶端電腦和筆電)上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG),並支援基於事件的視覺任務。然而,不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU,但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite,這是第一個硬體感知框架,透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行:(1) 啟用 NPU 執行,(2) 最佳化效能,以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配,並使用 StaGr 進行靜態聚合,而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能,並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率,其中 QuantGr 適用 INT8 量化,而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上,GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速,在 CPU 和 GPU 上實現了高達 8.6X 的能源增益,在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。 +Legal cases require careful logical reasoning following the laws, whereas +interactions with non-technical users must be in natural language. As an +application combining logical reasoning using Prolog and natural language +processing using large language models (LLMs), this paper presents a novel +approach and system, LogicLease, to automate the analysis of landlord-tenant +legal cases in the state of New York. LogicLease determines compliance with +relevant legal requirements by analyzing case descriptions and citing all +relevant laws. It leverages LLMs for information extraction and Prolog for +legal reasoning. By separating information extraction from legal reasoning, +LogicLease achieves greater transparency and control over the legal logic +applied to each case. We evaluate the accuracy, efficiency, and robustness of +LogicLease through a series of tests, achieving 100% accuracy and an average +processing time of 2.57 seconds. LogicLease presents advantages over +state-of-the-art LLM-based legal analysis systems by providing clear, +step-by-step reasoning, citing specific laws, and distinguishing itself by its +ability to avoid hallucinations -- a common issue in LLMs. -##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language** -2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin +摘要:法律案件需要遵循法律进行谨慎的逻辑推理,而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序,本文提出了一种新颖的方法和系统 LogicLease,以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取,并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开,LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性,实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理,引用具体法律,并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统,从而显示出优势——这是 LLM 中的常见问题。 -Recent advancements in AI for biological research focus on integrating -molecular data with natural language to accelerate drug discovery. However, the -scarcity of high-quality annotations limits progress in this area. This paper -introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework -that leverages large language models to augment existing datasets, thereby -improving AI training. We demonstrate the effectiveness of LA$^3$ by creating -an enhanced dataset, LaChEBI-20, where we systematically rewrite the -annotations of molecules from an established dataset. These rewritten -annotations preserve essential molecular information while providing more -varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 -based on a benchmark architecture to learn the mapping between molecular -representations and augmented annotations. - Experimental results on text-based *de novo* molecule generation and molecule -captioning demonstrate that LaMolT5 outperforms state-of-the-art models. -Notably, incorporating LA$^3$ leads to improvements of up to 301% over the -benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ -notable applications in *image*, *text* and *graph* tasks, affirming its -versatility and utility. +##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia** +2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott -摘要:人工智慧在生物研究上的最新進展,專注於將分子資料與自然語言整合,以加速藥物發現。然而,高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$,一個基於語言的自動註解擴充框架,它利用大型語言模型來擴充現有的資料集,進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性,我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊,同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20,我們在基於基準架構上訓練 LaMolT5,以學習分子表示和擴充註解之間的對應。 -在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明,LaMolT5 優於最先進的模型。值得注意的是,納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外,我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性,肯定了它的多功能性和實用性。 +In remote healthcare monitoring, time series representation learning reveals +critical patient behavior patterns from high-frequency data. This study +analyzes home activity data from individuals living with dementia by proposing +a two-stage, self-supervised learning approach tailored to uncover low-rank +structures. The first stage converts time-series activities into text sequences +encoded by a pre-trained language model, providing a rich, high-dimensional +latent state space using a PageRank-based method. This PageRank vector captures +latent state transitions, effectively compressing complex behaviour data into a +succinct form that enhances interpretability. This low-rank representation not +only enhances model interpretability but also facilitates clustering and +transition analysis, revealing key behavioral patterns correlated with +clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the +framework's potential in supporting cognitive status prediction, personalized +care interventions, and large-scale health monitoring. -##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment** -2502.06472v1 by Yuxing Lu, Jinzhuo Wang +摘要:在遠程醫療監控中,時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據,該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列,使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換,有效地將複雜的行為數據壓縮成簡潔的形式,從而增強了解力。此低秩表示不僅增強了模型的可解釋性,還促進了聚類和轉換分析,揭示了與臨床指標(例如 MMSE 和 ADAS-COG 分數)相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。 -Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical -for modern AI systems, but manual curation struggles to scale with the rapid -growth of scientific literature. This paper presents KARMA, a novel framework -employing multi-agent large language models (LLMs) to automate KG enrichment -through structured analysis of unstructured text. Our approach employs nine -collaborative agents, spanning entity discovery, relation extraction, schema -alignment, and conflict resolution that iteratively parse documents, verify -extracted knowledge, and integrate it into existing graph structures while -adhering to domain-specific schema. Experiments on 1,200 PubMed articles from -three different domains demonstrate the effectiveness of KARMA in knowledge -graph enrichment, with the identification of up to 38,230 new entities while -achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% -through multi-layer assessments. +##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design** +2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang -摘要:維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要,但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA,一個採用多代理大型語言模型 (LLM) 的新框架,透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理,涵蓋實體發現、關係提取、架構比對和衝突解決,這些代理會反覆分析文件、驗證提取的知識,並將其整合到現有的圖結構中,同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性,識別出多達 38,230 個新實體,同時達到 83.1% 的 LLM 驗證正確性,並透過多層評估將衝突邊緣降低了 18.6%。 +Taste peptides have emerged as promising natural flavoring agents attributed +to their unique organoleptic properties, high safety profile, and potential +health benefits. However, the de novo identification of taste peptides derived +from animal, plant, or microbial sources remains a time-consuming and +resource-intensive process, significantly impeding their widespread application +in the food industry. Here, we present TastePepAI, a comprehensive artificial +intelligence framework for customized taste peptide design and safety +assessment. As the key element of this framework, a loss-supervised adaptive +variational autoencoder (LA-VAE) is implemented to efficiently optimizes the +latent representation of sequences during training and facilitates the +generation of target peptides with desired taste profiles. Notably, our model +incorporates a novel taste-avoidance mechanism, allowing for selective flavor +exclusion. Subsequently, our in-house developed toxicity prediction algorithm +(SpepToxPred) is integrated in the framework to undergo rigorous safety +evaluation of generated peptides. Using this integrated platform, we +successfully identified 73 peptides exhibiting sweet, salty, and umami, +significantly expanding the current repertoire of taste peptides. This work +demonstrates the potential of TastePepAI in accelerating taste peptide +discovery for food applications and provides a versatile framework adaptable to +broader peptide engineering challenges. -##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs** -2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang +摘要:味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而,从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程,严重阻碍了它们在食品工业中的广泛应用。在此,我们提出了 TastePepAI,这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素,实现了损失监督自适应变分自动编码器 (LA-VAE),以在训练期间有效优化序列的潜在表示,并促进生成具有所需味觉特征的目标肽。值得注意的是,我们的模型包含了一种新颖的味觉回避机制,允许选择性排除风味。随后,我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中,以对生成的肽进行严格的安全评估。使用这个集成平台,我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽,极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力,并提供了一个适用于更广泛的肽工程挑战的多功能框架。 -Mitigating positional bias of language models (LMs) for listwise inputs is a -well-known and important problem (e.g., lost-in-the-middle). While zero-shot -order-invariant LMs have been proposed to solve this issue, their success on -practical listwise problems has been limited. In this work, as a first -contribution, we identify and overcome two limitations to make zero-shot -invariant LMs more practical: (1) training and inference distribution mismatch -arising from modifying positional ID assignments to enforce invariance, and (2) -failure to adapt to a mixture of order-invariant and sensitive inputs in -practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot -invariant LM for genuinely order-invariant inputs with minimal modifications of -positional IDs, and (2) Selective Routing, an adaptive framework that handles -both order-invariant and order-sensitive inputs in listwise tasks. On the Lost -in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU -benchmarks, we show that RoToR with Selective Routing can effectively handle -practical listwise input tasks in a zero-shot manner. +##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification** +2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan -摘要:語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題(例如,迷失在中間)。雖然已經提出零次學習順序不變的 LM 來解決這個問題,但它們在實際列表問題上的成功卻很有限。在這項工作中,作為第一個貢獻,我們找出並克服了兩個限制,讓零次學習不變的 LM 更有實用性:(1) 訓練和推論分布不匹配,這是由於修改位置 ID 分配以強制不變性所造成的,以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題,我們提出 (1) RoToR,一個零次學習不變的 LM,用於真正不變的輸入,並對位置 ID 進行最小的修改,以及 (2) 選擇性路由,一個自適應框架,用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中,我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。 +Precise segmentation and classification of cell instances are vital for +analyzing the tissue microenvironment in histology images, supporting medical +diagnosis, prognosis, treatment planning, and studies of brain +cytoarchitecture. However, the creation of high-quality annotated datasets for +training remains a major challenge. This study introduces a novel single-stage +approach (HistoSmith) for generating image-label pairs to augment histology +datasets. Unlike state-of-the-art methods that utilize diffusion models with +separate components for label and image generation, our approach employs a +latent diffusion model to learn the joint distribution of cellular layouts, +classification masks, and histology images. This model enables tailored data +generation by conditioning on user-defined parameters such as cell types, +quantities, and tissue types. Trained on the Conic H&E histopathology dataset +and the Nissl-stained CytoDArk0 dataset, the model generates realistic and +diverse labeled samples. Experimental results demonstrate improvements in cell +instance segmentation and classification, particularly for underrepresented +cell types like neutrophils in the Conic dataset. These findings underscore the +potential of our approach to address data scarcity challenges. -##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model** -2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen +摘要:精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而,建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith),用於產生影像標籤對,以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同,我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數(例如細胞類型、數量和組織類型)來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後,此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步,特別是對於 Conic 資料集中代表性不足的細胞類型,例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。 -Recent advancements in large language models (LLMs) have significantly -improved various natural language processing (NLP) tasks. Typically, LLMs are -trained to predict the next token, aligning well with many NLP tasks. However, -in knowledge graph (KG) scenarios, entities are the fundamental units and -identifying an entity requires at least several tokens. This leads to a -granularity mismatch between KGs and natural languages. To address this issue, -we propose K-ON, which integrates KG knowledge into the LLM by employing -multiple head layers for next k-step prediction. K-ON can not only generate -entity-level results in one step, but also enables contrastive loss against -entities, which is the most powerful tool in KG representation learning. -Experimental results show that K-ON outperforms state-of-the-art methods that -incorporate text and even the other modalities. +##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion** +2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì -摘要:大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常,LLM 會接受訓練以預測下一個符號,這與許多 NLP 任務非常吻合。然而,在知識圖譜 (KG) 場景中,實體是基本單位,而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題,我們提出了 K-ON,它透過採用多個頭部層進行下一個 k 步預測,將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果,還能針對實體啟用對比損失,這是 KG 表示學習中最有力的工具。實驗結果顯示,K-ON 優於將文字甚至其他方式納入考量的最新方法。 +The growing availability of longitudinal Magnetic Resonance Imaging (MRI) +datasets has facilitated Artificial Intelligence (AI)-driven modeling of +disease progression, making it possible to predict future medical scans for +individual patients. However, despite significant advancements in AI, current +methods continue to face challenges including achieving patient-specific +individualization, ensuring spatiotemporal consistency, efficiently utilizing +longitudinal data, and managing the substantial memory demands of 3D scans. To +address these challenges, we propose Brain Latent Progression (BrLP), a novel +spatiotemporal model designed to predict individual-level disease progression +in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates +in a small latent space, mitigating the computational challenges posed by +high-dimensional imaging data; (ii) it explicitly integrates subject metadata +to enhance the individualization of predictions; (iii) it incorporates prior +knowledge of disease dynamics through an auxiliary model, facilitating the +integration of longitudinal data; and (iv) it introduces the Latent Average +Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in +the predicted progression at inference time and (b) allows us to derive a +measure of the uncertainty for the prediction. We train and evaluate BrLP on +11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its +generalizability on an external test set comprising 2,257 MRIs from 962 +subjects. Our experiments compare BrLP-generated MRI scans with real follow-up +MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The +code is publicly available at: https://github.com/LemuelPuglisi/BrLP. -##### **LegalViz: Legal Text Visualization by Text To Diagram Generation** -2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita +摘要:隨著縱向磁共振影像 (MRI) 資料集的日益普及,已促進人工智慧 (AI) 驅動的疾病進程建模,讓預測個別患者的未來醫學掃描成為可能。然而,儘管 AI 有顯著進展,目前的技術仍面臨挑戰,包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料,以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰,我們提出腦潛在進程 (BrLP),這是一種新穎的時空模型,旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個:(i) 它在一個小的潛在空間中運作,減輕了高維度影像資料帶來的計算挑戰;(ii) 它明確整合受試者的元資料,以增強預測的個別化;(iii) 它透過輔助模型納入疾病動態的先驗知識,促進縱向資料的整合;(iv) 它引入了潛在平均穩定化 (LAS) 演算法,該演算法 (a) 在推論時強制預測進程中的時空一致性,(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估,並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較,與現有方法相比,展示了最先進的準確性。程式碼已公開於:https://github.com/LemuelPuglisi/BrLP。 -Legal documents including judgments and court orders require highly -sophisticated legal knowledge for understanding. To disclose expert knowledge -for non-experts, we explore the problem of visualizing legal texts with -easy-to-understand diagrams and propose a novel dataset of LegalViz with 23 -languages and 7,010 cases of legal document and visualization pairs, using the -DOT graph description language of Graphviz. LegalViz provides a simple diagram -from a complicated legal corpus identifying legal entities, transactions, legal -sources, and statements at a glance, that are essential in each judgment. In -addition, we provide new evaluation metrics for the legal diagram visualization -by considering graph structures, textual similarities, and legal contents. We -conducted empirical studies on few-shot and finetuning large language models -for generating legal diagrams and evaluated them with these metrics, including -legal content-based evaluation within 23 languages. Models trained with -LegalViz outperform existing models including GPTs, confirming the -effectiveness of our dataset. +##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data** +2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai -摘要:法律文件,包括判決和法院命令,需要高度專業的法律知識才能理解。為了向非專家揭露專家知識,我們探討了使用易於理解的圖表將法律文本視覺化的問題,並提出了一個新的 LegalViz 數據集,其中包含 23 種語言和 7,010 個法律文件和視覺化配對,使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表,可以一目了然地識別法律實體、交易、法律來源和陳述,這些在每項判決中都是必不可少的。此外,我們通過考慮圖形結構、文本相似性和法律內容,為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究,以生成法律圖表,並使用這些指標對它們進行了評估,包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型,包括 GPT,證實了我們數據集的有效性。 +The adoption of EHRs has expanded opportunities to leverage data-driven +algorithms in clinical care and research. A major bottleneck in effectively +conducting multi-institutional EHR studies is the data heterogeneity across +systems with numerous codes that either do not exist or represent different +clinical concepts across institutions. The need for data privacy further limits +the feasibility of including multi-institutional patient-level data required to +study similarities and differences across patient subgroups. To address these +challenges, we developed the GAME algorithm. Tested and validated across 7 +institutions and 2 languages, GAME integrates data in several levels: (1) at +the institutional level with knowledge graphs to establish relationships +between codes and existing knowledge sources, providing the medical context for +standard codes and their relationship to each other; (2) between institutions, +leveraging language models to determine the relationships between +institution-specific codes with established standard codes; and (3) quantifying +the strength of the relationships between codes using a graph attention +network. Jointly trained embeddings are created using transfer and federated +learning to preserve data privacy. In this study, we demonstrate the +applicability of GAME in selecting relevant features as inputs for AI-driven +algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. +We then highlight the application of GAME harmonized multi-institutional EHR +data in a study of Alzheimer's disease outcomes and suicide risk among patients +with mental health disorders, without sharing patient-level data outside +individual institutions. -##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs** -2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee +摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。 -Mental-illness stigma is a persistent social problem, hampering both -treatment-seeking and recovery. Accordingly, there is a pressing need to -understand it more clearly, but analyzing the relevant data is highly -labor-intensive. Therefore, we designed a chatbot to engage participants in -conversations; coded those conversations qualitatively with AI assistance; and, -based on those coding results, built causal knowledge graphs to decode stigma. -The results we obtained from 1,002 participants demonstrate that conversation -with our chatbot can elicit rich information about people's attitudes toward -depression, while our AI-assisted coding was strongly consistent with -human-expert coding. Our novel approach combining large language models (LLMs) -and causal knowledge graphs uncovered patterns in individual responses and -illustrated the interrelationships of psychological constructs in the dataset -as a whole. The paper also discusses these findings' implications for HCI -researchers in developing digital interventions, decomposing human -psychological constructs, and fostering inclusive attitudes. +##### **EEG Artifact Detection and Correction with Deep Autoencoders** +2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch -摘要:精神疾病的污名化是一個持續存在的社會問題,阻礙了尋求治療和康復。因此,迫切需要更清楚地了解它,但分析相關數據非常費力。因此,我們設計了一個聊天機器人,讓參與者參與對話;使用 AI 協助對這些對話進行定性編碼;並根據這些編碼結果,構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明,與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊,而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式,並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。 +EEG signals convey important information about brain activity both in healthy +and pathological conditions. However, they are inherently noisy, which poses +significant challenges for accurate analysis and interpretation. Traditional +EEG artifact removal methods, while effective, often require extensive expert +intervention. This study presents LSTEEG, a novel LSTM-based autoencoder +designed for the detection and correction of artifacts in EEG signals. +Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear +dependencies in sequential EEG data. LSTEEG demonstrates superior performance +in both artifact detection and correction tasks compared to other +state-of-the-art convolutional autoencoders. Our methodology enhances the +interpretability and utility of the autoencoder's latent space, enabling +data-driven automated artefact removal in EEG its application in downstream +tasks. This research advances the field of efficient and accurate multi-channel +EEG preprocessing, and promotes the implementation and usage of automated EEG +analysis pipelines for brain health applications. + +摘要:腦電圖訊號傳達了關於大腦活動的重要資訊,無論是在健康或病理狀況下。然而,它們本質上是有雜訊的,這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效,但通常需要大量的專家介入。本研究提出 LSTEEG,一種新穎的基於 LSTM 的自動編碼器,用於偵測和校正腦電圖訊號中的人工製品。利用深度學習,特別是 LSTM 層,LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比,LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性,讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域,並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。 -##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification** -2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya +##### **SycEval: Evaluating LLM Sycophancy** +2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo -In this paper, we address the task of semantic segmentation of legal -documents through rhetorical role classification, with a focus on Indian legal -judgments. We introduce LegalSeg, the largest annotated dataset for this task, -comprising over 7,000 documents and 1.4 million sentences, labeled with 7 -rhetorical roles. To benchmark performance, we evaluate multiple -state-of-the-art models, including Hierarchical BiLSTM-CRF, -TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and -Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an -instruction-tuned large language model. Our results demonstrate that models -incorporating broader context, structural relationships, and sequential -sentence information outperform those relying solely on sentence-level -features. Additionally, we conducted experiments using surrounding context and -predicted or actual labels of neighboring sentences to assess their impact on -classification accuracy. Despite these advancements, challenges persist in -distinguishing between closely related roles and addressing class imbalance. -Our work underscores the potential of advanced techniques for improving legal -document understanding and sets a strong foundation for future research in -legal NLP. +Large language models (LLMs) are increasingly applied in educational, +clinical, and professional settings, but their tendency for sycophancy -- +prioritizing user agreement over independent reasoning -- poses risks to +reliability. This study introduces a framework to evaluate sycophantic behavior +in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and +MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19% +of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the +lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred +in 43.52% of cases, while regressive sycophancy, leading to incorrect answers, +was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher +sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$, +$p<0.001$), particularly in computational tasks, where regressive sycophancy +increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$). +Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while +citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$, +$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI: +[77.2%, 79.8%]) regardless of context or model. These findings emphasize the +risks and opportunities of deploying LLMs in structured and dynamic domains, +offering insights into prompt programming and model optimization for safer AI +applications. -摘要:在本文中,我們通過修辭角色分類來探討法律文件的語義分段任務,重點關注印度法律判決。我們引入了 LegalSeg,這是此任務中最大的註釋資料集,包含超過 7,000 份文件和 140 萬個句子,並標記了 7 個修辭角色。為了評量效能,我們評估了多個最先進的模型,包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer,以及探索性的 RhetoricLLaMA,一種經過指令調整的大型語言模型。我們的結果表明,結合廣泛背景、結構關係和順序句子資訊的模型,表現優於僅依賴句子層級特徵的模型。此外,我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗,以評估它們對分類精度的影響。儘管有這些進展,但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力,並為法律自然語言處理的未來研究奠定了堅實的基礎。 +摘要:大型語言模型(LLM)日益應用於教育、臨床和專業領域,但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為,涉及 AMPS(數學)和 MedQuad(醫療建議)數據集。在 58.19% 的案例中觀察到了趨炎附勢行為,其中 Gemini 表現出最高比率(62.47%),而 ChatGPT 最低(56.71%)。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中,而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率(61.75% 對 56.52%,Z=5.87,p<0.001),特別是在計算任務中,其中退步式趨炎附勢顯著增加(先發制人:8.13%,上下文:3.54%,p<0.001)。簡單的反駁最大化了漸進式趨炎附勢(Z=6.59,p<0.001),而基於引用的反駁表現出最高的退步式比率(Z=6.59,p<0.001)。趨炎附勢行為表現出很高的持續性(78.5%,95% CI:[77.2%,79.8%]),無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇,為更安全的 AI 應用提供了提示編程和模型優化的見解。 -##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning** -2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong +##### **Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models** +2502.09659v1 by Hasin Rehana, Jie Zheng, Leo Yeh, Benu Bansal, Nur Bengisu Çam, Christianah Jemiyo, Brett McGregor, Arzucan Özgür, Yongqun He, Junguk Hur -Developing intelligent agents for long-term cooperation in dynamic open-world -scenarios is a major challenge in multi-agent systems. Traditional Multi-agent -Reinforcement Learning (MARL) frameworks like centralized training -decentralized execution (CTDE) struggle with scalability and flexibility. They -require centralized long-term planning, which is difficult without custom -reward functions, and face challenges in processing multi-modal data. CTDE -approaches also assume fixed cooperation strategies, making them impractical in -dynamic environments where agents need to adapt and plan independently. To -address decentralized multi-agent cooperation, we propose Decentralized -Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in -a novel Multi-agent Crafter environment. Our generative agents, powered by -Large Language Models (LLMs), are more scalable than traditional MARL agents by -leveraging external knowledge and language for long-term planning and -reasoning. Instead of fully sharing information from all past experiences, -DAMCS introduces a multi-modal memory system organized as a hierarchical -knowledge graph and a structured communication protocol to optimize agent -cooperation. This allows agents to reason from past interactions and share -relevant information efficiently. Experiments on novel multi-agent open-world -tasks show that DAMCS outperforms both MARL and LLM baselines in task -efficiency and collaboration. Compared to single-agent scenarios, the two-agent -scenario achieves the same goal with 63% fewer steps, and the six-agent -scenario with 74% fewer steps, highlighting the importance of adaptive memory -and structured communication in achieving long-term goals. We publicly release -our project at: https://happyeureka.github.io/damcs. +Motivation: An adjuvant is a chemical incorporated into vaccines that +enhances their efficacy by improving the immune response. Identifying adjuvant +names from cancer vaccine studies is essential for furthering research and +enhancing immunotherapies. However, the manual curation from the constantly +expanding biomedical literature poses significant challenges. This study +explores the automated recognition of vaccine adjuvant names using Large +Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) +and Large Language Model Meta AI (Llama). Methods: We utilized two datasets: 97 +clinical trial records from AdjuvareDB and 290 abstracts annotated with the +Vaccine Adjuvant Compendium (VAC). GPT-4o and Llama 3.2 were employed in +zero-shot and few-shot learning paradigms with up to four examples per prompt. +Prompts explicitly targeted adjuvant names, testing the impact of contextual +information such as substances or interventions. Outputs underwent automated +and manual validation for accuracy and consistency. Results: GPT-4o attained +100% Precision across all situations while exhibiting notable improve in Recall +and F1-scores, particularly with incorporating interventions. On the VAC +dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, +surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o +reached an F1-score of 81.67% for three-shot prompting with interventions, +surpassing Llama-3.2-3 B's maximum F1-score of 65.62%. Conclusion: Our findings +demonstrate that LLMs excel at identifying adjuvant names, including rare +variations of naming representation. This study emphasizes the capability of +LLMs to enhance cancer vaccine development by efficiently extracting insights. +Future work aims to broaden the framework to encompass various biomedical +literature and enhance model generalizability across various vaccines and +adjuvants. -摘要:在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架,例如集中式訓練去中心化執行 (CTDE),在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃,這在沒有自訂獎勵函數的情況下很難執行,並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略,這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題,我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援,透過利用外部知識和語言進行長期規劃和推理,比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊,而是引入了多模式記憶體系統,該系統組織成階層式知識圖譜和結構化通訊協定,以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明,DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比,雙重代理情境以少 63% 的步驟達成相同的目標,而六重代理情境則以少 74% 的步驟達成目標,突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於:https://happyeureka.github.io/damcs。 +摘要:動機:佐劑是一種加入疫苗的化學物質,能藉由改善免疫反應來提升疫苗的效力。從癌症疫苗研究中找出佐劑名稱對於推進研究和改善免疫療法至關重要。然而,從不斷擴展的生物醫學文獻中手動整理會造成重大挑戰。本研究探討使用大型語言模型 (LLM),特別是生成式預訓練Transformer (GPT) 和大型語言模型 Meta AI (Llama) 來自動辨識疫苗佐劑名稱。方法:我們使用兩個資料集:來自 AdjuvareDB 的 97 份臨床試驗記錄和 290 篇標註了疫苗佐劑彙編 (VAC) 的摘要。GPT-4o 和 Llama 3.2 被用於零次學習和少量學習範例,每個提示最多有四個範例。提示明確鎖定佐劑名稱,測試物質或介入措施等背景資訊的影響。輸出經過自動和手動驗證,以確保準確性和一致性。結果:GPT-4o 在所有情況下都達到 100% 的準確率,同時在召回率和 F1 分數上表現出顯著的進步,特別是在納入介入措施的情況下。在 VAC 資料集上,GPT-4o 在有介入措施的情況下達到 77.32% 的最高 F1 分數,比 Llama-3.2-3B 高出約 2%。在 AdjuvareDB 資料集上,GPT-4o 在有介入措施的三次提示中達到 81.67% 的 F1 分數,超過 Llama-3.2-3 B 的最高 F1 分數 65.62%。結論:我們的研究結果表明,LLM 在辨識佐劑名稱方面表現出色,包括命名表示的罕見變異。本研究強調了 LLM 在有效提取見解方面增強癌症疫苗開發的能力。未來的研究工作旨在擴大架構,涵蓋各種生物醫學文獻,並增強模型在各種疫苗和佐劑中的泛化能力。 -##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation** -2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang +##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?** +2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace -Graphs are able to model interconnected entities in many online services, -supporting a wide range of applications on the Web. This raises an important -question: How can we train a graph foundational model on multiple source -domains and adapt to an unseen target domain? A major obstacle is that graphs -from different domains often exhibit divergent characteristics. Some studies -leverage large language models to align multiple domains based on textual -descriptions associated with the graphs, limiting their applicability to -text-attributed graphs. For text-free graphs, a few recent works attempt to -align different feature distributions across domains, while generally -neglecting structural differences. In this work, we propose a novel Structure -Alignment framework for text-free Multi-domain Graph Pre-Training and -cross-domain adaptation (SAMGPT). It is designed to learn multi-domain -knowledge from graphs originating in multiple source domains, which can then be -adapted to address applications in an unseen target domain. Specifically, we -introduce a set of structure tokens to harmonize structure-based aggregation -across source domains during the pre-training phase. Next, for cross-domain -adaptation, we design dual prompts, namely, holistic prompts and specific -prompts, which adapt unified multi-domain structural knowledge and -fine-grained, domain-specific information, respectively, to a target domain. -Finally, we conduct comprehensive experiments on seven public datasets to -evaluate and analyze the effectiveness of SAMGPT. +Medical research faces well-documented challenges in translating novel +treatments into clinical practice. Publishing incentives encourage researchers +to present "positive" findings, even when empirical results are equivocal. +Consequently, it is well-documented that authors often spin study results, +especially in article abstracts. Such spin can influence clinician +interpretation of evidence and may affect patient care decisions. In this +study, we ask whether the interpretation of trial results offered by Large +Language Models (LLMs) is similarly affected by spin. This is important since +LLMs are increasingly being used to trawl through and synthesize published +medical evidence. We evaluated 22 LLMs and found that they are across the board +more susceptible to spin than humans. They might also propagate spin into their +outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into +plain language summaries that they generate. We also find, however, that LLMs +are generally capable of recognizing spin, and can be prompted in a way to +mitigate spin's impact on LLM outputs. -摘要:圖表能夠在許多線上服務中對相互關聯的實體進行建模, -支援網路上廣泛的應用程式。這提出了重要的問題:我們如何針對多個來源網域訓練圖表基礎模型,並適應未見過的目標網域?一個主要的障礙是,來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型,根據與圖表相關的文字描述,對齊多個網域,限制其適用性於有文字屬性的圖表。對於沒有文字的圖表,最近的一些作品嘗試對齊跨網域的不同特徵分佈,同時通常忽略結構上的差異。在這項工作中,我們提出了一個新的結構對齊框架,用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識,然後可以適應於未見過的目標網域中的應用程式。具體來說,我們引入了一組結構化代碼,以在預訓練階段,調和跨來源網域的基於結構的聚合。接下來,對於跨網域適應,我們設計了雙重提示,即整體提示和具體提示,分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後,我們在七個公共資料集上進行了全面的實驗,以評估和分析 SAMGPT 的有效性。 +摘要:醫學研究在將新穎療法轉化為臨床實務上,面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現,即使經驗結果模稜兩可。因此,有據可查的是,作者經常扭曲研究結果,特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋,並可能影響病患照護決策。在本研究中,我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據,因此這點非常重要。我們評估了 22 個 LLM,發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中:例如,我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而,我們也發現 LLM 通常有能力辨認扭曲,而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。 -##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints** -2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang +##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating** +2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri -In-context learning (ICL) effectively conditions large language models (LLMs) -for molecular tasks, such as property prediction and molecule captioning, by -embedding carefully selected demonstration examples into the input prompt. This -approach avoids the computational overhead of extensive pertaining and -fine-tuning. However, current prompt retrieval methods for molecular tasks have -relied on molecule feature similarity, such as Morgan fingerprints, which do -not adequately capture the global molecular and atom-binding relationships. As -a result, these methods fail to represent the full complexity of molecular -structures during inference. Moreover, small-to-medium-sized LLMs, which offer -simpler deployment requirements in specialized systems, have remained largely -unexplored in the molecular ICL literature. To address these gaps, we propose a -self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context -learning, which aligns global molecular structures, represented by graph neural -networks (GNNs), with textual captions (descriptions) while leveraging local -feature similarity through Morgan fingerprints. In addition, we introduce a -Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to -optimize input prompt demonstration samples. Our experimental findings using -diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL -retrieval methods across all tasks by up to 45%. +This paper presents a novel Natural Language Processing (NLP) framework for +enhancing medical diagnosis through the integration of advanced techniques in +data augmentation, feature extraction, and classification. The proposed +approach employs back-translation to generate diverse paraphrased datasets, +improving robustness and mitigating overfitting in classification tasks. +Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with +Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained +contextual and positional relationships, dynamically adjusting the influence of +positional information based on semantic context to produce high-quality text +embeddings. For classification, an Attention-Based Feedforward Neural Network +(ABFNN) is utilized, effectively focusing on the most relevant features to +improve decision-making accuracy. Applied to the classification of symptoms, +clinical notes, and other medical texts, this architecture demonstrates its +ability to address the complexities of medical data. The combination of data +augmentation, contextual embedding generation, and advanced classification +mechanisms offers a robust and accurate diagnostic tool, with potential +applications in automated medical diagnosis and clinical decision support. This +method demonstrates the effectiveness of the proposed NLP framework for medical +diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of +99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only +underscore the model's robust performance in classifying medical texts with +exceptional precision and reliability but also highlight its superiority over +existing methods, making it a highly promising tool for automated diagnostic +systems. -摘要:情境學習 (ICL) 有效地調整大型語言模型 (LLM),以執行分子任務,例如屬性預測和分子標題,方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而,目前針對分子任務的提示檢索方法依賴於分子特徵相似性,例如 Morgan 指紋,而無法充分捕捉全局分子和原子鍵結關係。因此,這些方法無法在推理過程中表示分子結構的完整複雜性。此外,在專業系統中提供更簡單部署需求的小到中型的 LLM,在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距,我們提出了一種自我監督學習技術,GAMIC(圖形對齊分子情境學習),它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題(描述)對齊,同時透過 Morgan 指紋利用局部特徵相似性。此外,我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法,以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示,GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法,最多可達 45%。 +摘要:本文提出了一個創新的自然語言處理 (NLP) 框架,透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集,提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa),這個模型捕捉細緻的脈絡和位置關係,根據語意脈絡動態調整位置資訊的影響,以產生高品質的文字嵌入。在分類方面,利用基於注意力的前饋神經網路 (ABFNN),有效地關注最相關的特徵,以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類,此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具,在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性,以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數,取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性,也突顯了它優於現有方法的優越性,使其成為自動化診斷系統中極具前景的工具。 -##### **Knowledge Graph-Guided Retrieval Augmented Generation** -2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu +##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension** +2502.07752v2 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds -Retrieval-augmented generation (RAG) has emerged as a promising technology -for addressing hallucination issues in the responses generated by large -language models (LLMs). Existing studies on RAG primarily focus on applying -semantic-based approaches to retrieve isolated relevant chunks, which ignore -their intrinsic relationships. In this paper, we propose a novel Knowledge -Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes -knowledge graphs (KGs) to provide fact-level relationships between chunks, -improving the diversity and coherence of the retrieved results. Specifically, -after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG -employs a KG-guided chunk expansion process and a KG-based chunk organization -process to deliver relevant and important knowledge in well-organized -paragraphs. Extensive experiments conducted on the HotpotQA dataset and its -variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based -approaches, in terms of both response quality and retrieval quality. +Designing efficient optimizers for large language models (LLMs) with +low-memory requirements and fast convergence is an important and challenging +problem. This paper makes a step towards the systematic design of such +optimizers through the lens of structured Fisher information matrix (FIM) +approximation. We show that many state-of-the-art efficient optimizers can be +viewed as solutions to FIM approximation (under the Frobenius norm) with +specific structural assumptions. Building on these insights, we propose two +design recommendations of practical efficient optimizers for LLMs, involving +the careful selection of structural assumptions to balance generality and +efficiency, and enhancing memory efficiency of optimizers with general +structures through a novel low-rank extension framework. We demonstrate how to +use each design approach by deriving new memory-efficient optimizers: Row and +Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation +(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the +effectiveness, showing faster and better convergence than existing +memory-efficient baselines and Adam with little memory overhead. Notably, Alice +achieves better than 2x faster convergence over Adam, while RACS delivers +strong performance on the 1B model with SGD-like memory. -摘要:檢索增強生成 (RAG) 已成為一項有前途的技術,用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊,而忽略它們的內在關係。在本文中,我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架,它利用知識圖表 (KG) 來提供區塊之間的事實層級關係,從而提高檢索結果的多樣性和一致性。具體來說,在執行基於語義的檢索以提供種子區塊後,KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序,以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。 +摘要:設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的觀點,朝著系統化設計此類最佳化器邁出了一步。我們證明許多最先進的高效最佳化器可以視為 FIM 近似(在 Frobenius 範數下)的解,並具有特定的結構假設。基於這些見解,我們提出了 LLM 的兩個實用高效最佳化器設計建議,包括仔細選擇結構假設以平衡通用性和效率,以及透過新穎的低秩擴充框架增強一般結構最佳化器的記憶體效率。我們透過推導新的記憶體高效最佳化器來展示如何使用每種設計方法:列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練(高達 1B 參數)上的實驗驗證了其有效性,顯示比現有的記憶體高效基準和 Adam 更快且更好的收斂,且記憶體開銷很小。值得注意的是,Alice 的收斂速度比 Adam 快 2 倍以上,而 RACS 則在 1B 模型上提供類似 SGD 的記憶體的強勁效能。 -##### **Can Large Language Models Understand Intermediate Representations?** -2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan +##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation** +2502.07516v2 by Raman Dutt -Intermediate Representations (IRs) are essential in compiler design and -program analysis, yet their comprehension by Large Language Models (LLMs) -remains underexplored. This paper presents a pioneering empirical study to -investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA -3.1, and Code Llama, in understanding IRs. We analyze their performance across -four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code -summarization, and execution reasoning. Our results indicate that while LLMs -demonstrate competence in parsing IR syntax and recognizing high-level -structures, they struggle with control flow reasoning, execution semantics, and -loop handling. Specifically, they often misinterpret branching instructions, -omit critical IR operations, and rely on heuristic-based reasoning, leading to -errors in CFG reconstruction, IR decompilation, and execution reasoning. The -study underscores the necessity for IR-specific enhancements in LLMs, -recommending fine-tuning on structured IR datasets and integration of explicit -control flow models to augment their comprehension and handling of IR-related -tasks. +Generative models, particularly text-to-image (T2I) diffusion models, play a +crucial role in medical image analysis. However, these models are prone to +training data memorization, posing significant risks to patient privacy. +Synthetic chest X-ray generation is one of the most common applications in +medical image analysis with the MIMIC-CXR dataset serving as the primary data +repository for this task. This study presents the first systematic attempt to +identify prompts and text tokens in MIMIC-CXR that contribute the most to +training data memorization. Our analysis reveals two unexpected findings: (1) +prompts containing traces of de-identification procedures (markers introduced +to hide Protected Health Information) are the most memorized, and (2) among all +tokens, de-identification markers contribute the most towards memorization. +This highlights a broader issue with the standard anonymization practices and +T2I synthesis with MIMIC-CXR. To exacerbate, existing inference-time +memorization mitigation strategies are ineffective and fail to sufficiently +reduce the model's reliance on memorized text tokens. On this front, we propose +actionable strategies for different stakeholders to enhance privacy and improve +the reliability of generative models in medical imaging. Finally, our results +provide a foundation for future work on developing and benchmarking +memorization mitigation techniques for synthetic chest X-ray generation using +the MIMIC-CXR dataset. The anonymized code is available at +https://anonymous.4open.science/r/diffusion_memorization-8011/ -摘要:中間表徵 (IR) 在編譯器設計和程式分析中至關重要,但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究,以探討 LLM(包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama)理解 IR 的能力。我們分析了它們在四項任務中的表現:控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明,儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力,但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言,它們經常誤解分支指令、省略關鍵 IR 操作,並依賴於基於啟發式的推理,導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性,建議對結構化的 IR 資料集進行微調,並整合明確的控制流程模型,以增強其對 IR 相關任務的理解和處理。 +摘要:生成模型,尤其是文本到影像 (T2I) 擴散模型在醫學影像分析中扮演著至關重要的角色。然而,這些模型容易訓練資料記憶,對病患隱私構成重大風險。合成胸部 X 光影像生成是醫學影像分析中最常見的應用之一,而 MIMIC-CXR 資料集則作為此任務的主要資料儲存庫。本研究提出了第一個系統化的嘗試,以識別 MIMIC-CXR 中對訓練資料記憶貢獻最大的提示和文字代碼。我們的分析揭示了兩個出乎意料的發現:(1) 包含去識別程序痕跡的提示(用於隱藏受保護健康資訊的標記)是最容易被記憶的,以及 (2) 在所有代碼中,去識別標記對記憶的貢獻最大。這突顯了標準匿名化實務和使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。更糟的是,現有的推論時間記憶減緩策略無效,無法充分降低模型對記憶文字代碼的依賴。在這個方面,我們針對不同的利害關係人提出可行的策略,以增強隱私和改善生成模型在醫學影像中的可靠性。最後,我們的結果為未來開發和評量使用 MIMIC-CXR 資料集進行合成胸部 X 光影像生成的記憶減緩技術奠定了基礎。已匿名化的程式碼可在 https://anonymous.4open.science/r/diffusion_memorization-8011/ 取得。 -##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?** -2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen +##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level** +2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo -Long-context large language models (LLMs) have recently shown strong -performance in information retrieval and long-document QA. However, to tackle -the most challenging intellectual problems, LLMs must reason effectively in -long and complex contexts (e.g., frontier mathematical research). Studying how -LLMs handle increasing reasoning complexity and context length is essential, -yet existing benchmarks lack a solid basis for quantitative evaluation. -Inspired by the abstraction of GSM-8K problems as computational graphs, and the -ability to introduce noise by adding unnecessary nodes and edges, we develop a -grade school math problem generator capable of producing arithmetic problems -with infinite difficulty and context length under fine-grained control. Using -our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate -existing LLMs. We find a consistent sigmoid decline in reasoning performance as -complexity increases, along with a systematic inference scaling trend: -exponentially increasing inference computation yields only linear performance -gains. These findings underscore the fundamental limitations of current -long-context LLMs and the key challenges in scaling reasoning capabilities. Our -GSM-Infinite benchmark provides a scalable and controllable testbed for -systematically studying and advancing LLM reasoning in long and complex -contexts. +Chronic kidney disease (CKD) is a major global health issue, affecting over +10% of the population and causing significant mortality. While kidney biopsy +remains the gold standard for CKD diagnosis and treatment, the lack of +comprehensive benchmarks for kidney pathology segmentation hinders progress in +the field. To address this, we organized the Kidney Pathology Image +Segmentation (KPIs) Challenge, introducing a dataset that incorporates +preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ +Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes +two tasks, patch-level segmentation and whole slide image segmentation and +detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. +By encouraging innovative segmentation methods that adapt to diverse CKD models +and tissue conditions, the KPIs Challenge aims to advance kidney pathology +analysis, establish new benchmarks, and enable precise, large-scale +quantification for disease research and diagnosis. -摘要:長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而,若要解決最具挑戰性的智力問題,LLM 必須在長且複雜的脈絡中有效推理(例如,前沿數學研究)。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要,但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發,以及透過加入不必要的節點和邊緣來引入雜訊的能力,我們開發了一個小學數學問題產生器,能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準,我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降,並伴隨著系統性的推論縮放趨勢:指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制,以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台,用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。 +摘要:慢性腎臟病 (CKD) 是全球主要的健康問題,影響超過 +10% 的人口,並造成顯著的死亡率。雖然腎臟活檢 +仍然是 CKD 診斷和治療的黃金標準,但缺乏 +腎臟病理學分割的全面基準阻礙了該領域的進展。 +為了解決這個問題,我們組織了腎臟病理影像 +分割 (KPIs) 挑戰,引入了包含超過 10,000 個註解的 +CKD 臨床前嚙齒動物模型的資料集,這些註解來自 60 多個 +週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括 +兩個任務,修補層級分割和全幻燈片影像分割和 +偵測,使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。 +通過鼓勵創新的分割方法來適應不同的 CKD 模型 +和組織條件,KPIs 挑戰旨在推進腎臟病理 +分析,建立新的基準,並實現精確、大規模的 +疾病研究和診斷量化。 -##### **Causality can systematically address the monsters under the bench(marks)** -2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf +##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer** +2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu -Effective and reliable evaluation is essential for advancing empirical -machine learning. However, the increasing accessibility of generalist models -and the progress towards ever more complex, high-level tasks make systematic -evaluation more challenging. Benchmarks are plagued by various biases, -artifacts, or leakage, while models may behave unreliably due to poorly -explored failure modes. Haphazard treatments and inconsistent formulations of -such "monsters" can contribute to a duplication of efforts, a lack of trust in -results, and unsupported inferences. In this position paper, we argue causality -offers an ideal framework to systematically address these challenges. By making -causal assumptions in an approach explicit, we can faithfully model phenomena, -formulate testable hypotheses with explanatory power, and leverage principled -tools for analysis. To make causal model design more accessible, we identify -several useful Common Abstract Topologies (CATs) in causal graphs which help -gain insight into the reasoning abilities in large language models. Through a -series of case studies, we demonstrate how the precise yet pragmatic language -of causality clarifies the strengths and limitations of a method and inspires -new approaches for systematic progress. +Early prediction of pediatric cardiac arrest (CA) is critical for timely +intervention in high-risk intensive care settings. We introduce PedCA-FT, a +novel transformer-based framework that fuses tabular view of EHR with the +derived textual view of EHR to fully unleash the interactions of +high-dimensional risk factors and their dynamics. By employing dedicated +transformer modules for each modality view, PedCA-FT captures complex temporal +and contextual patterns to produce robust CA risk estimates. Evaluated on a +curated pediatric cohort from the CHOA-CICU database, our approach outperforms +ten other artificial intelligence models across five key performance metrics +and identifies clinically meaningful risk factors. These findings underscore +the potential of multimodal fusion techniques to enhance early CA detection and +improve patient care. -摘要:有效的、可靠的評估對於推進經驗機器學習至關重要。然而,一般化模型的可及性日益提高,以及朝著更複雜、更高級別任務的進展,使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾,而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中,我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設,我們可以忠實地模擬現象,制定具有解釋力的可測試假設,並利用原則性的分析工具。為了使因果模型設計更易於使用,我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT),有助於深入了解大型語言模型中的推理能力。通過一系列案例研究,我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點,並激發系統進展的新方法。 +摘要:早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT,一個新穎的基於轉換器的框架,它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起,以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組,PedCA-FT 捕獲複雜的時間和上下文模式,以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估,我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型,並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。 -##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures** -2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha +##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals** +2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari -Large Language Models (LLMs) have demonstrated impressive reasoning -capabilities, yet their performance is highly dependent on the prompting -strategy and model scale. While reinforcement learning and fine-tuning have -been deployed to boost reasoning, these approaches incur substantial -computational and data overhead. In this work, we introduce Adaptive Graph of -Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM -reasoning solely at test time. Rather than relying on fixed-step methods like -Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes -complex queries into structured subproblems, forming an dynamic directed -acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding -only those subproblems that require further analysis, AGoT unifies the -strengths of chain, tree, and graph paradigms into a cohesive framework that -allocates computation where it is most needed. We validate our approach on -diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and -mathematical problem-solving, achieving up to 46.2% improvement on scientific -reasoning tasks (GPQA) - comparable to gains achieved through computationally -intensive reinforcement learning approaches and outperforming state-of-the-art -iterative approaches. These results suggest that dynamic decomposition and -structured recursion offer a scalable, cost-effective alternative to -post-training modifications, paving the way for more robust, general-purpose -reasoning in LLMs. +Counterfactual explanations in medical imaging are critical for understanding +the predictions made by deep learning models. We extend the Latent Shift +counterfactual generation method from 2D applications to 3D computed tomography +(CT) scans. We address the challenges associated with 3D data, such as limited +training samples and high memory demands, by implementing a slice-based +approach. This method leverages a 2D encoder trained on CT slices, which are +subsequently combined to maintain 3D context. We demonstrate this technique on +two models for clinical phenotype prediction and lung segmentation. Our +approach is both memory-efficient and effective for generating interpretable +counterfactuals in high-resolution 3D medical imaging. -摘要:大型語言模型 (LLM) 已展現令人印象深刻的推理能力,但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理,但這些方法會造成大量的運算和資料開銷。在這項工作中,我們引入了「適應性思考圖」(AGoT),一個動態的、基於圖形的推論架構,它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法,而是遞迴地將複雜的查詢分解成結構化的子問題,形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題,AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中,將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法,在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進,這與透過運算密集的強化學習方法所獲得的增益相當,並且優於最先進的迭代方法。這些結果表明,動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案,用於訓練後修改,為 LLM 中更強健、更通用的推理鋪平了道路。 +摘要:反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法,來解決與 3D 資料相關的挑戰,例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器,隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實,既節省記憶體又有效。 -##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics** -2502.05239v1 by Hussam Ghanem, Christophe Cruz +##### **Interactive Data Harmonization with LLM Agents** +2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire -Recent advancements in large language models have demonstrated significant -potential in the automated construction of knowledge graphs from unstructured -text. This paper builds upon our previous work [16], which evaluated various -models using metrics like precision, recall, F1 score, triple matching, and -graph matching, and introduces a refined approach to address the critical -issues of hallucination and omission. We propose an enhanced evaluation -framework incorporating BERTScore for graph similarity, setting a practical -threshold of 95% for graph matching. Our experiments focus on the Mistral -model, comparing its original and fine-tuned versions in zero-shot and few-shot -settings. We further extend our experiments using examples from the KELM-sub -training dataset, illustrating that the fine-tuned model significantly improves -knowledge graph construction accuracy while reducing the exact hallucination -and omission. However, our findings also reveal that the fine-tuned models -perform worse in generalization tasks on the KELM-sub dataset. This study -underscores the importance of comprehensive evaluation metrics in advancing the -state-of-the-art in knowledge graph construction from textual data. +Data harmonization is an essential task that entails integrating datasets +from diverse sources. Despite years of research in this area, it remains a +time-consuming and challenging task due to schema mismatches, varying +terminologies, and differences in data collection methodologies. This paper +presents the case for agentic data harmonization as a means to both empower +experts to harmonize their data and to streamline the process. We introduce +Harmonia, a system that combines LLM-based reasoning, an interactive user +interface, and a library of data harmonization primitives to automate the +synthesis of data harmonization pipelines. We demonstrate Harmonia in a +clinical data harmonization scenario, where it helps to interactively create +reusable pipelines that map datasets to a standard format. Finally, we discuss +challenges and open problems, and suggest research directions for advancing our +vision. -摘要:大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上,該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型,並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架,結合 BERTScore 來進行圖形相似性,並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上,比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗,說明微調後的模型顯著提高了知識圖譜建構的準確度,同時減少了精確的幻覺和遺漏。然而,我們的研究結果也顯示,微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。 +摘要:資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷,但由於架構不匹配、術語不同,以及資料收集方法的差異,它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和,作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia,一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統,以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia,它有助於互動式建立可重複使用的管線,將資料集對應至標準格式。最後,我們討論挑戰和開放性問題,並建議研究方向以推進我們的願景。 -##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research** -2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu +##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML** +2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani -We introduce Agentic Reasoning, a framework that enhances large language -model (LLM) reasoning by integrating external tool-using agents. Unlike -conventional LLM-based reasoning approaches, which rely solely on internal -inference, Agentic Reasoning dynamically engages web search, code execution, -and structured reasoning-context memory to solve complex problems requiring -deep research and multi-step logical deduction. Our framework introduces the -Mind Map agent, which constructs a structured knowledge graph to track logical -relationships, improving deductive reasoning. Additionally, the integration of -web-search and coding agents enables real-time retrieval and computational -analysis, enhancing reasoning accuracy and decision-making. Evaluations on -PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks -demonstrate that our approach significantly outperforms existing models, -including leading retrieval-augmented generation (RAG) systems and -closed-source LLMs. Moreover, our results indicate that agentic reasoning -improves expert-level knowledge synthesis, test-time scalability, and -structured problem-solving. The code is at: -https://github.com/theworldofagents/Agentic-Reasoning. +Machine learning (ML) is transforming healthcare by enabling predictive +analytics, personalized treatments, and improved patient outcomes. However, +traditional ML workflows require specialized skills, infrastructure, and +resources, limiting accessibility for many healthcare professionals. This paper +explores how Google Cloud's BigQuery ML simplifies the development and +deployment of ML models using SQL, reducing technical barriers. Through a case +study on diabetes prediction using the Diabetes Health Indicators Dataset, we +evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep +Neural Network (DNN). Our results demonstrate that the Boosted Tree model +achieves the highest performance, making it highly effective for diabetes +prediction. This study highlights BigQuery ML's role in democratizing machine +learning by providing a scalable, efficient, and accessible solution for +healthcare analytics. -摘要:我們引入了代理推理,一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同,代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理,它建立一個結構化的知識圖譜來追蹤邏輯關係,改善演繹推理。此外,整合網路搜尋和編碼代理能進行即時擷取和運算分析,增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示,我們的做法明顯優於現有模型,包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外,我們的結果顯示,代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在:https://github.com/theworldofagents/Agentic-Reasoning。 +摘要:機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果,正在轉型醫療保健。然而,傳統的 ML 工作流程需要專業技能、基礎設施和資源,限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署,降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究,我們評估了三個預測模型:邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明,提升樹模型達到了最高的效能,使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色,提供可擴充、有效率且可存取的醫療保健分析解決方案。 -##### **Position-aware Automatic Circuit Discovery** -2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov +##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements** +2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen -A widely used strategy to discover and understand language model mechanisms -is circuit analysis. A circuit is a minimal subgraph of a model's computation -graph that executes a specific task. We identify a gap in existing circuit -discovery methods: they assume circuits are position-invariant, treating model -components as equally relevant across input positions. This limits their -ability to capture cross-positional interactions or mechanisms that vary across -positions. To address this gap, we propose two improvements to incorporate -positionality into circuits, even on tasks containing variable-length examples. -First, we extend edge attribution patching, a gradient-based method for circuit -discovery, to differentiate between token positions. Second, we introduce the -concept of a dataset schema, which defines token spans with similar semantics -across examples, enabling position-aware circuit discovery in datasets with -variable length examples. We additionally develop an automated pipeline for -schema generation and application using large language models. Our approach -enables fully automated discovery of position-sensitive circuits, yielding -better trade-offs between circuit size and faithfulness compared to prior work. +Despite over a decade of legislative efforts to address modern slavery in the +supply chains of large corporations, the effectiveness of government oversight +remains hampered by the challenge of scrutinizing thousands of statements +annually. While Large Language Models (LLMs) can be considered a well +established solution for the automatic analysis and summarization of documents, +recognizing concrete modern slavery countermeasures taken by companies and +differentiating those from vague claims remains a challenging task. To help +evaluate and fine-tune LLMs for the assessment of corporate statements, we +introduce a dataset composed of 5,731 modern slavery statements taken from the +Australian Modern Slavery Register and annotated at the sentence level. This +paper details the construction steps for the dataset that include the careful +design of annotation specifications, the selection and preprocessing of +statements, and the creation of high-quality annotation subsets for effective +model evaluations. To demonstrate our dataset's utility, we propose a machine +learning methodology for the detection of sentences relevant to mandatory +reporting requirements set by the Australian Modern Slavery Act. We then follow +this methodology to benchmark modern language models under zero-shot and +supervised learning settings. -摘要:廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖,可執行特定任務。我們找出電路發現方法中的一個缺口:它們假設電路與位置無關,將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口,我們提出兩項改進,將位置性納入電路中,即使在包含變長範例的任務中也是如此。首先,我們擴充邊緣屬性修補,一種基於梯度的電路發現方法,以區分符號位置。其次,我們引入了資料集架構的概念,它定義了在範例中具有類似語義的符號跨距,使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外,我們開發了一個自動化管線,用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化,與先前的研究相比,在電路大小和忠實度之間產生了更好的權衡。 +摘要:儘管立法努力超過十年,旨在解決大型企業供應鏈中的現代奴隸制,但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型(LLM)可以被認為是文件自動分析和摘要的完善解決方案,但要辨識公司採取的具體現代奴隸制對策,並將其與含糊的聲明區分開來,仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明,我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集,這些聲明取自澳洲現代奴隸制註冊處,並在句子層級進行註解。本文詳細說明了資料集的建構步驟,其中包括註解規格的仔細設計、聲明的選擇和預處理,以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用,我們提出了一種機器學習方法,用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後,我們遵循這種方法,在零次學習和監督學習設定下對現代語言模型進行基準測試。 -##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems** -2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister +##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium** +2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour -We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by -jointly optimizing model roles and weights. We represent multi-LLM systems as -directed acyclic graphs (DAGs) of LLMs with topological message passing for -collaborative generation. Given a pool of LLM experts and a utility function, -Heterogeneous Swarms employs two iterative steps: role-step and weight-step. -For role-step, we interpret model roles as learning a DAG that specifies the -flow of inputs and outputs between LLMs. Starting from a swarm of random -continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs -in topological order, evaluate on the utility function (e.g. accuracy on a -task), and optimize the adjacency matrices with particle swarm optimization -based on the utility score. For weight-step, we assess the contribution of -individual LLMs in the multi-LLM systems and optimize model weights with swarm -intelligence. We propose JFK-score to quantify the individual contribution of -each LLM in the best-found DAG of the role-step, then optimize model weights -with particle swarm optimization based on the JFK-score. Experiments -demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based -baselines by 18.5% on average across 12 tasks. Further analysis reveals that -Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles -and substantial collaborative gains, and benefits from the diversity of -language models. +The fourth Machine Learning for Health (ML4H) symposium was held in person on +December 15th and 16th, 2024, in the traditional, ancestral, and unceded +territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, +British Columbia, Canada. The symposium included research roundtable sessions +to foster discussions between participants and senior researchers on timely and +relevant topics for the ML4H community. The organization of the research +roundtables at the conference involved 13 senior and 27 junior chairs across 13 +tables. Each roundtable session included an invited senior chair (with +substantial experience in the field), junior chairs (responsible for +facilitating the discussion), and attendees from diverse backgrounds with an +interest in the session's topic. -摘要:我們提出異質群體,一種演算法,透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG),並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數,異質群體使用兩個反覆步驟:角色步驟和權重步驟。對於角色步驟,我們將模型角色解釋為學習一個 DAG,它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始,我們將它們解碼為離散 DAG,以拓撲順序呼叫 LLM,根據效用函數(例如任務的準確度)進行評估,並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟,我們評估個別 LLM 在多 LLM 系統中的貢獻,並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻,然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明,異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明,異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統,並受益於語言模型的多樣性。 +摘要:第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議,以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席(在該領域擁有豐富的經驗)、初級主席(負責促進討論)以及對會議主題感興趣的來自不同背景的與會者。 -##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot** -2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao +##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering** +2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla -Retrieval-augmented generation (RAG) is a well-suited technique for -retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a -key module of the healthcare copilot, helping reduce misdiagnosis for -healthcare practitioners and patients. However, the diagnostic accuracy and -specificity of existing heuristic-based RAG models used in the medical domain -are inadequate, particularly for diseases with similar manifestations. This -paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited -reasoning for the medical domain that retrieves diagnosis and treatment -recommendations based on manifestations. MedRAG systematically constructs a -comprehensive four-tier hierarchical diagnostic KG encompassing critical -diagnostic differences of various diseases. These differences are dynamically -integrated with similar EHRs retrieved from an EHR database, and reasoned -within a large language model. This process enables more accurate and specific -decision support, while also proactively providing follow-up questions to -enhance personalized medical decision-making. MedRAG is evaluated on both a -public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) -collected from Tan Tock Seng Hospital, and its performance is compared against -various existing RAG methods. Experimental results show that, leveraging the -information integration and relational abilities of the KG, our MedRAG provides -more specific diagnostic insights and outperforms state-of-the-art models in -reducing misdiagnosis rates. Our code will be available at -https://github.com/SNOWTEAM2023/MedRAG +Current Large Language Models (LLMs) benchmarks are often based on open-ended +or close-ended QA evaluations, avoiding the requirement of human labor. +Close-ended measurements evaluate the factuality of responses but lack +expressiveness. Open-ended capture the model's capacity to produce discourse +responses but are harder to assess for correctness. These two approaches are +commonly used, either independently or together, though their relationship +remains poorly understood. This work is focused on the healthcare domain, where +both factuality and discourse matter greatly. It introduces a comprehensive, +multi-axis suite for healthcare LLM evaluation, exploring correlations between +open and close benchmarks and metrics. Findings include blind spots and +overlaps in current methodologies. As an updated sanity check, we release a new +medical benchmark --CareQA-- with both open and closed variants. Finally, we +propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to +mitigate the identified limitations. -摘要:檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組,協助減少醫療保健從業人員和患者的誤診。然而,在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足,特別是對於具有類似表現的疾病。本文提出 MedRAG,一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型,用於醫療領域,它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG,涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合,並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援,同時主動提供後續問題,以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估,並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示,利用 KG 的資訊整合和關係能力,我們的 MedRAG 提供了更具體的診斷見解,並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供 +摘要:當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量,避免了人力需求。封閉式測量評估回應的事實性,但缺乏表達力。開放式測量捕捉模型產生論述回應的能力,但較難評估正確性。這兩種方法通常獨立或合併使用,儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域,在該領域中,事實性和論述都非常重要。它引入了一個全面的多軸套件,用於醫療保健 LLM 評量,探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查,我們發布了一個新的醫療基準--CareQA--,包含開放式和封閉式變體。最後,我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。 -##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering** -2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck +##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging** +2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra -Most existing Knowledge Graph Question Answering (KGQA) approaches are -designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the -heterogeneity of the underlying graph schema, topology and assertions, most -KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without -resource-intensive training data. We present OntoSCPrompt, a novel Large -Language Model (LLM)-based KGQA approach with a two-stage architecture that -separates semantic parsing from KG-dependent interactions. OntoSCPrompt first -generates a SPARQL query structure (including SPARQL keywords such as SELECT, -ASK, WHERE and placeholders for missing tokens) and then fills them with -KG-specific information. To enhance the understanding of the underlying KG, we -present an ontology-guided, hybrid prompt learning strategy that integrates KG -ontology into the learning process of hybrid prompts (e.g., discrete and -continuous vectors). We also present several task-specific decoding strategies -to ensure the correctness and executability of generated SPARQL queries in both -stages. Experimental results demonstrate that OntoSCPrompt performs as well as -SOTA approaches without retraining on a number of KGQA datasets such as CWQ, -WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well -to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: -\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} +Accurate classification and anatomical localization are essential for +effective medical diagnostics and research, which may be efficiently performed +using deep learning techniques. However, availability of limited labeled data +poses a significant challenge. To address this, we adapted Prototypical +Networks and the Propagation-Reconstruction Network (PRNet) for few-shot +classification and localization, respectively, in Single Photon Emission +Computed Tomography (SPECT) images. For the proof of concept we used a +2D-sliced image cropped around heart. The Prototypical Network, with a +pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver +tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for +2D imaging with an encoder-decoder architecture and skip connections, achieved +a training loss of 1.395, accurately reconstructing patches and capturing +spatial relationships. These results highlight the potential of Prototypical +Networks for tissue classification with limited labeled data and PRNet for +anatomical landmark localization, paving the way for improved performance in +deep learning frameworks. -摘要:現有的知識圖譜問答(KGQA)方法大多是為特定 KG 而設計的,例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性,大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜(KG)。我們提出 OntoSCPrompt,這是一種基於大型語言模型(LLM)的新型 KGQA 方法,採用兩階段架構,將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構(包括 SPARQL 關鍵字,例如 SELECT、ASK、WHERE 和缺失令牌的佔位符),然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解,我們提出了一種由本体指導的混合提示學習策略,將 KG 本体整合到混合提示(例如,離散和連續向量)的學習過程中。我們還提出了多種特定任務的解碼策略,以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明,OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時,效能與 SOTA 方法一樣好,且資源使用效率高,並且可以很好地概括到未見過的特定領域 KG,例如 DBLP-QuAD 和 CoyPu KG Code: -\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt} +摘要:精確的分類和解剖定位對於有效的醫療診斷和研究至關重要,而這可以使用深度學習技術有效執行。然而,標記資料有限的取得會造成重大的挑戰。為了解決這個問題,我們分別調整了原型網路和傳播重建網路 (PRNet),用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念,我們使用圍繞心臟裁切的 2D 切片影像。原型網路,使用預先訓練的 ResNet-18 主幹,對心室、心肌和肝臟組織進行分類,訓練準確度為 96.67%,驗證準確度為 93.33%。PRNet,調整為使用編碼器解碼器架構和跳躍連接的 2D 影像,達到了 1.395 的訓練損失,精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力,以及 PRNet 在解剖標誌定位方面的潛力,為深度學習架構中效能的提升鋪平了道路。 -##### **Multimodal Medical Code Tokenizer** -2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik +##### **Illegal Waste Detection in Remote Sensing Images: A Case Study** +2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori -Foundation models trained on patient electronic health records (EHRs) require -tokenizing medical data into sequences of discrete vocabulary items. Existing -tokenizers treat medical codes from EHRs as isolated textual tokens. However, -each medical code is defined by its textual description, its position in -ontological hierarchies, and its relationships to other codes, such as disease -co-occurrences and drug-treatment associations. Medical vocabularies contain -more than 600,000 codes with critical information for clinical reasoning. We -introduce MedTok, a multimodal medical code tokenizer that uses the text -descriptions and relational context of codes. MedTok processes text using a -language model encoder and encodes the relational structure with a graph -encoder. It then quantizes both modalities into a unified token space, -preserving modality-specific and cross-modality information. We integrate -MedTok into five EHR models and evaluate it on operational and clinical tasks -across in-patient and out-patient datasets, including outcome prediction, -diagnosis classification, drug recommendation, and risk stratification. -Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR -models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with -the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate -using MedTok tokenizer with medical QA systems. Our results demonstrate the -potential of MedTok as a unified tokenizer for medical codes, improving -tokenization for medical foundation models. +Environmental crime currently represents the third largest criminal activity +worldwide while threatening ecosystems as well as human health. Among the +crimes related to this activity, improper waste management can nowadays be +countered more easily thanks to the increasing availability and decreasing cost +of Very-High-Resolution Remote Sensing images, which enable semi-automatic +territory scanning in search of illegal landfills. This paper proposes a +pipeline, developed in collaboration with professionals from a local +environmental agency, for detecting candidate illegal dumping sites leveraging +a classifier of Remote Sensing images. To identify the best configuration for +such classifier, an extensive set of experiments was conducted and the impact +of diverse image characteristics and training settings was thoroughly analyzed. +The local environmental agency was then involved in an experimental exercise +where outputs from the developed classifier were integrated in the experts' +everyday work, resulting in time savings with respect to manual +photo-interpretation. The classifier was eventually run with valuable results +on a location outside of the training area, highlighting potential for +cross-border applicability of the proposed pipeline. -摘要:在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而,每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系(例如疾病共现和药物治疗关联)来定义。医学词汇表包含超过 600,000 个代码,这些代码包含临床推理的关键信息。我们引入了 MedTok,这是一种多模态医学代码标记器,它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本,并使用图编码器对关系结构进行编码。然后,它将这两种模态量化为一个统一的标记空间,保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中,并在住院和门诊数据集(包括结果预测、诊断分类、药物推荐和风险分层)上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC,在 MIMIC-III 上提高 4.10%,在 MIMIC-IV 上提高 4.78%,在 EHRShot 上提高 11.30%,其中药物推荐的增益最大。除了 EHR 建模之外,我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力,改进了医学基础模型的标记化。 +摘要:環境犯罪目前是全球第三大犯罪活動,威脅生態系統和人類健康。在與此活動相關的犯罪中,不當廢物管理現在可以更容易地得到解決,這要歸功於超高解析度遙測影像越來越普及且成本下降,這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道,與當地環境機構的專業人士合作開發,用於檢測候選非法傾倒地點,利用遙測影像分類器。為了找出這種分類器的最佳配置,進行了一系列廣泛的實驗,並徹底分析了不同影像特徵和訓練設定的影響。然後,當地環境機構參與了一項實驗練習,其中將已開發分類器的輸出整合到專家的日常工作中,從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器,獲得了有價值的結果,突出了所提出管道的跨境適用性潛力。 -##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents** -2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu +##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model** +2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li -The rapid expansion of web content has made on-device AI assistants -indispensable for helping users manage the increasing complexity of online -tasks. The emergent reasoning ability in large language models offer a -promising path for next-generation on-device AI agents. However, deploying -full-scale Large Language Models (LLMs) on resource-limited local devices is -challenging. In this paper, we propose Division-of-Thoughts (DoT), a -collaborative reasoning framework leveraging the synergy between locally -deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT -leverages a Task Decomposer to elicit the inherent planning abilities in -language models to decompose user queries into smaller sub-tasks, which allows -hybrid language models to fully exploit their respective strengths. Besides, -DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks -and create a dependency graph, facilitating parallel reasoning of sub-tasks and -the identification of key steps. To allocate the appropriate model based on the -difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an -additional task head attached to the SLM that does not alter the SLM's -parameters. To boost adapter's task allocation capability, we propose a -self-reinforced training method that relies solely on task execution feedback. -Extensive experiments on various benchmarks demonstrate that our DoT -significantly reduces LLM costs while maintaining competitive reasoning -accuracy. Specifically, DoT reduces the average reasoning time and API costs by -66.12% and 83.57%, while achieving comparable reasoning accuracy with the best -baseline methods. +Accurate and efficient electroencephalography (EEG) analysis is essential for +detecting seizures and artifacts in long-term monitoring, with applications +spanning hospital diagnostics to wearable health devices. Robust EEG analytics +have the potential to greatly improve patient care. However, traditional deep +learning models, especially Transformer-based architectures, are hindered by +their quadratic time and memory complexity, making them less suitable for +resource-constrained environments. To address these challenges, we present +FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel +self-supervised framework that establishes new efficiency benchmarks for EEG +analysis through bidirectional state-space modeling. Unlike Transformer-based +models, which incur quadratic time and memory complexity, FEMBA scales linearly +with sequence length, enabling more scalable and efficient processing of +extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and +fine-tuned on three downstream tasks, FEMBA achieves competitive performance in +comparison with transformer models, with significantly lower computational +cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB +and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates +viability for resource-constrained devices. These results pave the way for +scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as +a promising candidate for wearable applications. -摘要:網頁內容快速擴充,使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而,在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中,我們提出了思想分工 (DoT),一個協作推理框架,利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力,將使用者查詢分解成較小的子任務,這允許混合語言模型充分發揮其各自的優勢。此外,DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖,促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型,DoT 利用了即插即用適配器,這是一個附加在 SLM 上的任務頭,不會改變 SLM 的參數。為了提升適配器的任務分配能力,我們提出了一種自我強化訓練方法,它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明,我們的 DoT 大幅降低了 LLM 成本,同時維持了有競爭力的推理準確度。具體來說,DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%,同時達到了與最佳基準方法相當的推理準確度。 +摘要:準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要,其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而,傳統深度學習模型,特別是基於 Transformer 的架構,受到其二次時間和記憶體複雜度的阻礙,使其不太適合資源受限的環境。為了應對這些挑戰,我們提出 FEMBA (基礎 EEG Mamba + 雙向架構),一種創新的自我監督架構,透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同,FEMBA 隨著序列長度線性縮放,支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調,與Transformer模型相比,在計算成本顯著降低的情況下,實現了具有競爭力的效能。具體來說,它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC,而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路,並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。 -##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models** -2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong +##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?** +2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham -Knowledge Graph-based recommendations have gained significant attention due -to their ability to leverage rich semantic relationships. However, constructing -and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy -of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent -advancements in Large Language Models (LLMs) offer a promising way to improve -the quality and relevance of KGs for recommendation tasks. Despite this, -integrating LLMs into KG-based systems presents challenges, such as efficiently -augmenting KGs, addressing hallucinations, and developing effective joint -learning methods. In this paper, we propose the Confidence-aware KG-based -Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework -that combines KGs and LLMs for recommendation task. The framework includes: (1) -an LLM-based subgraph augmenter for enriching KGs with high-quality -information, (2) a confidence-aware message propagation mechanism to filter -noisy triplets, and (3) a dual-view contrastive learning method to integrate -user-item interactions and KG data. Additionally, we employ a confidence-aware -explanation generation process to guide LLMs in producing realistic -explanations for recommendations. Finally, extensive experiments demonstrate -the effectiveness of CKG-LLMA across multiple public datasets. +The advent of foundation models (FMs) is transforming medical domain. In +ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 +million natural images and 1.6 million retinal images, has demonstrated high +adaptability across clinical applications. Conversely, DINOv2, a +general-purpose vision FM pre-trained on 142 million natural images, has shown +promise in non-medical domains. However, its applicability to clinical tasks +remains underexplored. To address this, we conducted head-to-head evaluations +by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular +disease detection and systemic disease prediction tasks, across eight +standardized open-source ocular datasets, as well as the Moorfields AlzEye and +the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting +diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, +all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In +glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, +P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 +models in predicting heart failure, myocardial infarction, and ischaemic stroke +(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even +with 10% of the fine-tuning data. These findings showcase the distinct +scenarios where general-purpose and domain-specific FMs excel, highlighting the +importance of aligning FM selection with task-specific requirements to optimise +clinical performance. -摘要:基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而,構建和維護知識圖譜 (KG) 是一項資源密集型任務,而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此,將 LLM 整合到基於 KG 的系統中會帶來挑戰,例如有效擴充 KG、處理幻覺,以及開發有效的聯合學習方法。在本文中,我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA),這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括:(1) 一個基於 LLM 的子圖擴充器,用於使用高品質資訊豐富 KG,(2) 一個信心感知型訊息傳播機制,用於過濾雜訊三元組,以及 (3) 一個雙視圖對比學習方法,用於整合使用者-項目互動和 KG 資料。此外,我們採用一個信心感知型解釋產生程序,以引導 LLM 為推薦產生逼真的解釋。最後,大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。 +摘要:基礎模型 (FM) 的出現正在轉變醫療領域。在眼科,RETFound 是一個視網膜專用 FM,依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練,已展現出高度適應性,可應用於各種臨床應用。相反地,DINOv2 是一個通用視覺 FM,使用 1.42 億張自然影像進行預訓練,已展現出在非醫療領域的潛力。然而,其在臨床任務中的適用性仍未被充分探索。為了解決這個問題,我們針對眼部疾病偵測和全身性疾病預測任務,對 RETFound 和三個 DINOv2 模型(大型、基礎、小型)進行微調,並進行一對一的評估,使用八個標準化的開源眼科資料集,以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound(三個資料集的 AUROC=0.850-0.952,相較於 0.823-0.944,所有 P<=0.007)和多類眼部疾病(AUROC=0.892,相較於 0.846,P<0.001)。在青光眼方面,DINOv2 基礎模型優於 RETFound(AUROC=0.958,相較於 0.940,P<0.001)。相反地,RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型(AUROC=0.732-0.796,相較於 0.663-0.771,所有 P<0.001)。即使使用 10% 的微調資料,這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景,突顯了根據任務特定需求調整 FM 選擇,以最佳化臨床表現的重要性。 -##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)** -2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell +##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning** +2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun -Scene graphs have emerged as a structured and serializable environment -representation for grounded spatial reasoning with Large Language Models -(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason -framework for reasoning and planning with scene graphs. Our approach employs -two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and -information queries generation, and a (2) Retriever for extracting -corresponding graph information following the queries. Two agents collaborate -iteratively, enabling sequential reasoning and adaptive attention to graph -information. Unlike prior works, both agents are prompted only with the scene -graph schema rather than the full graph data, which reduces the hallucination -by limiting input tokens, and drives the Reasoner to generate reasoning trace -abstractly.Following the trace, the Retriever programmatically query the scene -graph data based on the schema understanding, allowing dynamic and global -attention on the graph that enhances alignment between reasoning and retrieval. -Through experiments in multiple simulation environments, we show that our -framework surpasses existing LLM-based approaches in numerical Q\&A and -planning tasks, and can benefit from task-level few-shot examples, even in the -absence of agent-level demonstrations. Project code will be released. +Medical time series are often irregular and face significant missingness, +posing challenges for data analysis and clinical decision-making. Existing +methods typically adopt a single modeling perspective, either treating series +data as sequences or transforming them into image representations for further +classification. In this paper, we propose a joint learning framework that +incorporates both sequence and image representations. We also design three +self-supervised learning strategies to facilitate the fusion of sequence and +image representations, capturing a more generalizable joint representation. The +results indicate that our approach outperforms seven other state-of-the-art +models in three representative real-world clinical datasets. We further +validate our approach by simulating two major types of real-world missingness +through leave-sensors-out and leave-samples-out techniques. The results +demonstrate that our approach is more robust and significantly surpasses other +baselines in terms of classification performance. -摘要:場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中,我們提出 SG-RwR,一個以綱要為導向的檢索與推理框架,用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理:一個 (1) 推論器,用於任務規劃和資訊查詢產生,以及一個 (2) 檢索器,用於根據查詢提取對應的圖形資訊。兩個代理反覆合作,實現對圖形資訊的順序推理和適應性關注。與先前的作品不同,兩個代理僅提示場景圖表綱要,而不是完整的圖形資料,這透過限制輸入代碼減少了幻覺,並驅使推論器抽象地產生推理軌跡。根據軌跡,檢索器根據綱要理解以程式化方式查詢場景圖形資料,允許對圖形進行動態和整體關注,增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗,我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法,並且可以受益於任務級別的少次範例,即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。 +摘要:醫療時間序列通常不規則且會面臨顯著的缺失,對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點,將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中,我們提出了一個聯合學習架構,結合序列和影像表示。我們還設計了三種自我監督學習策略,以促進序列和影像表示的融合,捕捉更具概括性的聯合表示。結果表明,我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明,我們的做法更強大,並且在分類效能方面顯著優於其他基準。 -##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs** -2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin +##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation** +2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek -Recent advancements have highlighted that Large Language Models (LLMs) are -prone to hallucinations when solving complex reasoning problems, leading to -erroneous results. To tackle this issue, researchers incorporate Knowledge -Graphs (KGs) to improve the reasoning ability of LLMs. However, existing -methods face two limitations: 1) they typically assume that all answers to the -questions are contained in KGs, neglecting the incompleteness issue of KGs, and -2) they treat the KG as a static repository and overlook the implicit logical -reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an -innovative neural-symbolic agent framework that achieves collaborative -augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments -and transform complex reasoning tasks into a multi-step interactive process, -enabling KGs to participate deeply in the reasoning process. SymAgent consists -of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages -LLM's inductive reasoning capability to extract symbolic rules from KGs, -guiding efficient question decomposition. The Agent-Executor autonomously -invokes predefined action tools to integrate information from KGs and external -documents, addressing the issues of KG incompleteness. Furthermore, we design a -self-learning framework comprising online exploration and offline iterative -policy updating phases, enabling the agent to automatically synthesize -reasoning trajectories and improve performance. Experimental results -demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields -better or comparable performance compared to various strong baselines. Further -analysis reveals that our agent can identify missing triples, facilitating -automatic KG updates. +We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), +an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS +predicts future PHTs using transformer-based architectures. The Adaptive Risk +Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk +probabilities for clinician-defined critical events. ARES incorporates a +personalized explainability module that identifies key clinical factors +influencing risk estimates for individual patients. ARES was evaluated on the +MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its +performance against traditional early warning systems and machine learning +models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, +with 60% including hospital admissions. The dataset contained over 357 million +tokens. ETHOS outperformed benchmark models in predicting hospital admissions, +ICU admissions, and prolonged hospital stays, achieving superior AUC scores. +ETHOS-based risk estimates demonstrated robustness across demographic subgroups +with strong model reliability, confirmed via calibration curves. The +personalized explainability module provides insights into patient-specific +factors contributing to risk. ARES, powered by ETHOS, advances predictive +healthcare AI by providing dynamic, real-time, and personalized risk estimation +with patient-specific explainability to enhance clinician trust. Its +adaptability and superior accuracy position it as a transformative tool for +clinical decision-making, potentially improving patient outcomes and resource +allocation in emergency and inpatient settings. We release the full code at +github.com/ipolharvard/ethos-ares to facilitate future research. -摘要:最近的進展強調出,大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺,導致錯誤的結果。為了解決這個問題,研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而,現有方法面臨兩個限制:1) 它們通常假設問題的所有答案都包含在 KG 中,忽略了 KG 的不完整性問題,以及 2) 它們將 KG 視為一個靜態儲存庫,而忽略了 KG 中固有的隱式邏輯推理結構。在本文中,我們介紹了 SymAgent,一個創新的神經符號代理架構,它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境,並將複雜的推理任務轉化為一個多步驟的互動過程,使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組:代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則,指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊,解決 KG 不完整性的問題。此外,我們設計了一個自學習框架,包括線上探索和離線反覆的政策更新階段,使代理能夠自動合成推理軌跡並改善效能。實驗結果表明,具有弱 LLM 主幹的 SymAgent(例如,7B 系列)與各種強大的基線相比,產生了更好或相當的效能。進一步的分析表明,我們的代理可以識別遺失的三元組,促進自動 KG 更新。 +摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), +一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS +使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。 -##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models** -2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov +##### **Can ChatGPT Diagnose Alzheimer's Disease?** +2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin -We introduce a new approach to systematically map features discovered by -sparse autoencoder across consecutive layers of large language models, -extending earlier work that examined inter-layer feature links. By using a -data-free cosine similarity technique, we trace how specific features persist, -transform, or first appear at each stage. This method yields granular flow -graphs of feature evolution, enabling fine-grained interpretability and -mechanistic insights into model computations. Crucially, we demonstrate how -these cross-layer feature maps facilitate direct steering of model behavior by -amplifying or suppressing chosen features, achieving targeted thematic control -in text generation. Together, our findings highlight the utility of a causal, -cross-layer interpretability framework that not only clarifies how features -develop through forward passes but also provides new means for transparent -manipulation of large language models. +Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating +neurodegenerative condition that affects approximately 1 in 9 individuals aged +65 and older, profoundly impairing memory and cognitive function. This paper +utilises 9300 electronic health records (EHRs) with data from Magnetic +Resonance Imaging (MRI) and cognitive tests to address an intriguing question: +As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs? +We present an in-depth evaluation of ChatGPT using a black-box approach with +zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to +analyse MRI and cognitive test results, as well as its potential as a +diagnostic tool for AD. By automating aspects of the diagnostic process, this +research opens a transformative approach for the healthcare system, +particularly in addressing disparities in resource-limited regions where AD +specialists are scarce. Hence, it offers a foundation for a promising method +for early detection, supporting individuals with timely interventions, which is +paramount for Quality of Life (QoL). -摘要:我們提出了一種新方法,用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能,擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術,我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖,實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是,我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導,在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用,不僅闡明了特徵如何透過前向傳遞發展,還提供了新的方法來透明地操作大型語言模型。 +摘要:ChatGPT 能否診斷出阿茲海默症 (AD)?AD 是一種毀滅性的神經退化性疾病,影響約 1/9 的 65 歲及以上人士,嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR),其中包含磁共振成像 (MRI) 和認知測試的數據,來解決一個有趣的問題:作為一個通用任務解決器,ChatGPT 能否使用 EHR 準確地檢測出 AD?我們使用黑盒方法對 ChatGPT 進行了深入評估,採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力,以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面,這項研究為醫療保健系統開啟了一種變革性的方法,特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此,它為一種有希望的早期檢測方法奠定了基礎,通過及時干預來支持個人,這對於生活品質 (QoL) 至關重要。 -##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs** -2502.02896v1 by Bradley P. Allen, Paul T. Groth +##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking** +2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares -Evaluating large language models (LLMs) for tasks like fact extraction in -support of knowledge graph construction frequently involves computing accuracy -metrics using a ground truth benchmark based on a knowledge graph (KG). These -evaluations assume that errors represent factual disagreements. However, human -discourse frequently features metalinguistic disagreement, where agents differ -not on facts but on the meaning of the language used to express them. Given the -complexity of natural language processing and generation using LLMs, we ask: do -metalinguistic disagreements occur between LLMs and KGs? Based on an -investigation using the T-REx knowledge alignment dataset, we hypothesize that -metalinguistic disagreement does in fact occur between LLMs and KGs, with -potential relevance for the practice of knowledge graph engineering. We propose -a benchmark for evaluating the detection of factual and metalinguistic -disagreements between LLMs and KGs. An initial proof of concept of such a -benchmark is available on Github. +EEG-based neural networks, pivotal in medical diagnosis and brain-computer +interfaces, face significant intellectual property (IP) risks due to their +reliance on sensitive neurophysiological data and resource-intensive +development. Current watermarking methods, particularly those using abstract +trigger sets, lack robust authentication and fail to address the unique +challenges of EEG models. This paper introduces a cryptographic wonder +filter-based watermarking framework tailored for EEG-based neural networks. +Leveraging collision-resistant hashing and public-key encryption, the wonder +filter embeds the watermark during training, ensuring minimal distortion ($\leq +5\%$ drop in EEG task accuracy) and high reliability (100\% watermark +detection). The framework is rigorously evaluated against adversarial attacks, +including fine-tuning, transfer learning, and neuron pruning. Results +demonstrate persistent watermark retention, with classification accuracy for +watermarked states remaining above 90\% even after aggressive pruning, while +primary task performance degrades faster, deterring removal attempts. Piracy +resistance is validated by the inability to embed secondary watermarks without +severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic +hashing ensures authentication, reducing brute-force attack success +probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet, +TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively +eliminating false positives. By integrating wonder filters with EEG-specific +adaptations, this work bridges a critical gap in IP protection for +neurophysiological models, offering a secure, tamper-proof solution for +healthcare and biometric applications. The framework's robustness against +adversarial modifications underscores its potential to safeguard sensitive EEG +models while maintaining diagnostic utility. -摘要:評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時,通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而,人類話語經常出現元語言分歧,其中代理人之間的差異不在於事實,而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性,我們提出疑問:LLM 和 KG 之間是否會發生元語言分歧?根據使用 T-REx 知識比對資料集進行的調查,我們假設元語言分歧確實會發生在 LLM 和 KG 之間,並可能與知識圖譜工程實務有關。我們提出一個基準,用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。 +摘要:基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要,由於其依賴敏感的神經生理資料和資源密集型的開發,面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法,特別是那些使用抽象觸發集的方法,缺乏強健的驗證,且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密,wonder 濾波器在訓練期間嵌入浮水印,確保最小的失真(EEG 任務準確度下降 $\leq 5\%$)和高可靠性(100% 浮水印檢測)。該架構針對對抗性攻擊進行了嚴格的評估,包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留,即使在激進的剪枝後,浮水印狀態的分類準確度仍保持在 90% 以上,而主要任務的性能下降得更快,阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證,而不會造成嚴重的準確度損失(在 EEGNet 和 CCNN 模型中 $>10\%$)。密碼學雜湊確保驗證,降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型(CCNN、EEGNet、TSception)進行評估,該方法達到了 $>99.4\%$ 的空嵌入準確度,有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合,這項工作彌補了神經生理模型 IP 保護中的關鍵差距,為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。 -##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization** -2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim +##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models** +2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen -Recent advances in Large Language Models (LLMs) have motivated the -development of general LLMs for molecular tasks. While several studies have -demonstrated that fine-tuned LLMs can achieve impressive benchmark -performances, they are far from genuine generalist molecular LLMs due to a lack -of fundamental understanding of molecular structure. Specifically, when given -molecular task instructions, LLMs trained with naive next-token prediction -training assign similar likelihood scores to both original and negatively -corrupted molecules, revealing their lack of molecular structure understanding -that is crucial for reliable and general molecular LLMs. To overcome this -limitation and obtain a true generalist molecular LLM, we introduce a novel -multi-modal training method based on a thorough multi-modal instruction tuning -as well as a molecular structure preference optimization between chosen and -rejected graphs. On various molecular benchmarks, the proposed generalist -molecular LLM, called Mol-LLM, achieves state-of-the-art performances among -generalist LLMs on most tasks, at the same time, surpassing or comparable to -state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior -generalization performances in reaction prediction tasks, demonstrating the -effect of the molecular structure understanding for generalization perspective. +Depression is one of the leading causes of disability worldwide, posing a +severe burden on individuals, healthcare systems, and society at large. Recent +advancements in Large Language Models (LLMs) have shown promise in addressing +mental health challenges, including the detection of depression through +text-based analysis. However, current LLM-based methods often struggle with +nuanced symptom identification and lack a transparent, step-by-step reasoning +process, making it difficult to accurately classify and explain mental health +conditions. To address these challenges, we propose a Chain-of-Thought +Prompting approach that enhances both the performance and interpretability of +LLM-based depression detection. Our method breaks down the detection process +into four stages: (1) sentiment analysis, (2) binary depression classification, +(3) identification of underlying causes, and (4) assessment of severity. By +guiding the model through these structured reasoning steps, we improve +interpretability and reduce the risk of overlooking subtle clinical indicators. +We validate our method on the E-DAIC dataset, where we test multiple +state-of-the-art large language models. Experimental results indicate that our +Chain-of-Thought Prompting technique yields superior performance in both +classification accuracy and the granularity of diagnostic insights, compared to +baseline approaches. -摘要:大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能,但由於缺乏對分子結構的基本理解,它們遠非真正的通才分子 LLM。具體來說,當給予分子任務說明時,使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子,這顯示出它們缺乏對分子結構的理解,而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM,我們引入了一種新穎的多模態訓練方法,該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中,所提出的通才分子 LLM(稱為 Mol-LLM)在多數任務中實現了通才 LLM 中的最新效能,同時超越或與最新的專家 LLM 相當。此外,Mol-LLM 在反應預測任務中也展現出優異的泛化效能,證明了分子結構理解對泛化觀點的影響。 +摘要:憂鬱症是全球殘障的主要原因之一,對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望,包括透過基於文字的分析來偵測憂鬱症。然而,現有的基於 LLM 的方法通常難以辨識細微的症狀,而且缺乏透明且逐步的推理過程,這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰,我們提出了一種思考鏈提示方法,它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段:(1) 情緒分析,(2) 二元憂鬱症分類,(3) 找出潛在原因,以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟,我們提升了可解釋性,並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法,並在其中測試了多種最先進的大型語言模型。實驗結果顯示,與基線方法相比,我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。 -##### **Leveraging the true depth of LLMs** -2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret +##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison** +2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis -Large Language Models demonstrate remarkable capabilities at the cost of high -compute requirements. While recent research has shown that intermediate layers -can be removed or have their order shuffled without impacting performance -significantly, these findings have not been employed to reduce the -computational cost of inference. We investigate several potential ways to -reduce the depth of pre-trained LLMs without significantly affecting -performance. Leveraging our insights, we present a novel approach that exploits -this decoupling between layers by grouping some of them into pairs that can be -evaluated in parallel. - This modification of the computational graph -- through better parallelism -- -results in an average improvement of around 1.20x on the number of tokens -generated per second, without re-training nor fine-tuning, while retaining -95%-99% of the original accuracy. Empirical evaluation demonstrates that this -approach significantly improves serving efficiency while maintaining model -performance, offering a practical improvement for large-scale LLM deployment. +The increasing volume of drug combinations in modern therapeutic regimens +needs reliable methods for predicting drug-drug interactions (DDIs). While +Large Language Models (LLMs) have revolutionized various domains, their +potential in pharmaceutical research, particularly in DDI prediction, remains +largely unexplored. This study thoroughly investigates LLMs' capabilities in +predicting DDIs by uniquely processing molecular structures (SMILES), target +organisms, and gene interaction data as raw text input from the latest DrugBank +dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4, +Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first +assessing their zero-shot capabilities in DDI prediction. We then fine-tuned +selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1 +distilled Qwen 1.5B) to optimize their performance. Our comprehensive +evaluation framework included validation across 13 external DDI datasets, +comparing against traditional approaches such as l2-regularized logistic +regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5 +2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of +0.919 on balanced datasets (50% positive, 50% negative cases). This result +represents an improvement over both zero-shot predictions and state-of-the-art +machine-learning methods used for DDI prediction. Our analysis reveals that +LLMs can effectively capture complex molecular interaction patterns and cases +where drug pairs target common genes, making them valuable tools for practical +applications in pharmaceutical research and clinical settings. -摘要:大型语言模型展示了其强大的功能,但代价是较高的计算需求。虽然最近的研究表明,中间层可以被移除或重新排列其顺序,而不会显著影响性能,但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度,而不会显著影响性能。利用我们的见解,我们提出了一种新颖的方法,该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。 -通过更好的并行性对计算图进行修改,平均而言,每秒生成的令牌数量提高了约 1.20 倍,而无需重新训练或微调,同时保留了 95%-99% 的原始准确性。经验评估表明,这种方法显著提高了服务效率,同时保持了模型性能,为大规模 LLM 部署提供了实际改进。 +摘要:現代治療方案中藥物組合的數量越來越多,需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命,它們在藥物研究中的潛力,特別是在 DDI 預測中的潛力,仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入,徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM,包括專有模型(GPT-4、Claude、Gemini)和開源變體(從 1.5B 到 72B 參數),首先評估它們在 DDI 預測中的零次學習能力。然後,我們微調選定的模型(GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B)以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證,並與傳統方法(例如 l2 正則化邏輯迴歸)進行比較。微調後的 LLM 表現出優異的效能,其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度,在平衡資料集(50% 正例,50% 反例)上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明,LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況,使其成為藥物研究和臨床環境中實用應用的寶貴工具。 -##### **Modular Training of Neural Networks aids Interpretability** -2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots +##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)** +2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh -An approach to improve neural network interpretability is via clusterability, -i.e., splitting a model into disjoint clusters that can be studied -independently. We define a measure for clusterability and show that pre-trained -models form highly enmeshed clusters via spectral graph clustering. We thus -train models to be more modular using a "clusterability loss" function that -encourages the formation of non-interacting clusters. Using automated -interpretability techniques, we show that our method can help train models that -are more modular and learn different, disjoint, and smaller circuits. We -investigate CNNs trained on MNIST and CIFAR, small transformers trained on -modular addition, and language models. Our approach provides a promising -direction for training neural networks that learn simpler functions and are -easier to interpret. +Detecting sensitive data such as Personally Identifiable Information (PII) +and Protected Health Information (PHI) is critical for data security platforms. +This study evaluates regex-based pattern matching algorithms and exact-match +search techniques to optimize detection speed, accuracy, and scalability. Our +benchmarking results indicate that Google RE2 provides the best balance of +speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among +regex engines, outperforming PCRE while maintaining broader hardware +compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated +superior performance (8 ms/MB) and scalability for large datasets. Performance +analysis revealed that regex processing time scales linearly with dataset size +and pattern complexity. A hybrid AI + Regex approach achieved the highest F1 +score (91. 6%) by improving recall and minimizing false positives. Device +benchmarking confirmed that our solution maintains efficient CPU and memory +usage on both high-performance and mid-range systems. Despite its +effectiveness, challenges remain, such as limited multilingual support and the +need for regular pattern updates. Future work should focus on expanding +language coverage, integrating data security and privacy management (DSPM) with +data loss prevention (DLP) tools, and enhancing regulatory compliance for +broader global adoption. -摘要:一種改善神經網路可解釋性的方法是透過群集性, -也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量,並顯示預訓練的 -模型透過光譜圖形群集形成高度糾纏的群集。因此,我們使用「群集性損失」函數訓練模型,使其更具模組化, -這鼓勵形成非交互群集。使用自動化可解釋性技術,我們顯示我們的模型可以幫助訓練更具模組化的模型,並學習不同、不相交且較小的電路。我們 -研究了在 MNIST 和 CIFAR 上訓練的 CNN,在模組化加法上訓練的小型Transformer,以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。 +摘要:偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料,對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術,以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示,在 regex 引擎中,Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡,優於 PCRE,同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對,Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示,regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低,達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效,但仍有挑戰存在,例如多語言支援有限,以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍,將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合,以及加強法規遵循以利更廣泛的全球採用。 -##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs** -2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür +##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch** +2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu -Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large -language models (LLMs) by enabling detailed step-by-step solutions. However, -due to the verbosity of LLMs, the resulting reasoning chains can be long, -making it harder to verify the reasoning steps and trace issues resulting from -dependencies between the steps that may be farther away in the sequence of -steps. Importantly, mathematical reasoning allows each step to be derived from -a small set of premises, which are a subset of the preceding steps in the -reasoning chain. In this paper, we present a framework that identifies the -premises for each step, to improve the evaluation of reasoning. We restructure -conventional linear reasoning chains into Premise Augmented Reasoning Chains -(PARC) by introducing premise links, resulting in a directed acyclic graph -where the nodes are the steps and the edges are the premise links. Through -experiments with a PARC-based dataset that we built, namely PERL (Premises and -ERrors identification in LLMs), we demonstrate that LLMs can reliably identify -premises within complex reasoning chains. In particular, even open-source LLMs -achieve 90% recall in premise identification. We also show that PARC helps to -identify errors in reasoning chains more reliably. The accuracy of error -identification improves by 6% to 16% absolute when step-by-step verification is -carried out in PARC under the premises. Our findings highlight the utility of -premise-centric representations in addressing complex problem-solving tasks and -open new avenues for improving the reliability of LLM-based reasoning -evaluations. +While just-in-time interventions (JITIs) have effectively targeted common +health behaviors, individuals often have unique needs to intervene in personal +undesirable actions that can negatively affect physical, mental, and social +well-being. We present WatchGuardian, a smartwatch-based JITI system that +empowers users to define custom interventions for these personal actions with a +small number of samples. For the model to detect new actions based on limited +new data samples, we developed a few-shot learning pipeline that finetuned a +pre-trained inertial measurement unit (IMU) model on public hand-gesture +datasets. We then designed a data augmentation and synthesis process to train +additional classification layers for customization. Our offline evaluation with +26 participants showed that with three, five, and ten examples, our approach +achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of +74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to +compare WatchGuardian against a rule-based intervention. Our results +demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in +undesirable actions, substantially outperforming the baseline by 29.0%. Our +findings underscore the effectiveness of a customizable, AI-driven JITI system +for individuals in need of behavioral intervention in personal undesirable +actions. We envision that our work can inspire broader applications of +user-defined personalized intervention with advanced AI solutions. -摘要:思考鏈(CoT)提示透過提供詳細的逐步解法,增強大型語言模型(LLM)的數學推理能力。然而,由於 LLM 的冗長,產生的推理鏈可能很長,這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難,而這些步驟可能在步驟順序中相距較遠。重要的是,數學推理允許每個步驟從一組小的前提中推導出來,這些前提是推理鏈中前一個步驟的子集。在本文中,我們提出了一個框架,用於識別每個步驟的前提,以改進推理評估。我們透過引入前提連結,將傳統的線性推理鏈重組為前提擴充推理鏈(PARC),產生一個有向無環圖,其中節點是步驟,而邊緣是前提連結。透過我們建立的基於 PARC 的資料集(即 PERL(LLM 中的前提和錯誤識別))進行的實驗,我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是,即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明,PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時,錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用,並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。 +摘要:雖然即時介入(JITIs)有效地針對常見的健康行為,但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian,這是一個基於智慧手錶的 JITI 系統,它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為,我們開發了一個小樣本學習管道,微調了公共手勢資料集上的預訓練慣性測量單元(IMU)模型。然後,我們設計了一個資料擴充和合成流程,以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示,我們的做法使用三個、五個和十個範例,達到了 76.8%、84.7% 和 87.7% 的平均準確度,以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後,我們進行了一項為時四小時的介入研究,以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明,我們的系統導致不良行為顯著減少了 64.0 +- 22.6%,大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用,並採用先進的 AI 解決方案。 -##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement** -2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna +##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care** +2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara -Embodied agents assisting humans are often asked to complete a new task in a -new scenario. An agent preparing a particular dish in the kitchen based on a -known recipe may be asked to prepare a new dish or to perform cleaning tasks in -the storeroom. There may not be sufficient resources, e.g., time or labeled -examples, to train the agent for these new situations. Large Language Models -(LLMs) trained on considerable knowledge across many domains are able to -predict a sequence of abstract actions for such new tasks and scenarios, -although it may not be possible for the agent to execute this action sequence -due to task-, agent-, or domain-specific constraints. Our framework addresses -these challenges by leveraging the generic predictions provided by LLM and the -prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an -agent to quickly adapt to new tasks and scenarios. The robot also solicits and -uses human input as needed to refine its existing knowledge. Based on -experimental evaluation over cooking and cleaning tasks in simulation domains, -we demonstrate that the interplay between LLM, KG, and human input leads to -substantial performance gains compared with just using the LLM output. +Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group +of cancers that account for more than 35% of cancer-related deaths worldwide, +but postoperative complications are unpredictable and can be life-threatening. +In this paper, we investigate how recent advancements in large language models +(LLMs) can benefit remote patient monitoring (RPM) systems through clinical +integration by designing RECOVER, an LLM-powered RPM system for postoperative +GI cancer care. To closely engage stakeholders in the design process, we first +conducted seven participatory design sessions with five clinical staff and +interviewed five cancer patients to derive six major design strategies for +integrating clinical guidelines and information needs into LLM-based RPM +systems. We then designed and implemented RECOVER, which features an +LLM-powered conversational agent for cancer patients and an interactive +dashboard for clinical staff to enable efficient postoperative RPM. Finally, we +used RECOVER as a pilot system to assess the implementation of our design +strategies with four clinical staff and five patients, providing design +implications by identifying crucial design elements, offering insights on +responsible AI, and outlining opportunities for future LLM-powered RPM systems. -摘要:具身代理协助人类时,通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源(例如时间或标记的示例)来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列,尽管代理可能无法执行此动作序列,因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战,使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估,我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。 +摘要:癌症手術是胃腸道 (GI) 癌症的主要治療方式,這類癌症佔全球癌症相關死亡人數的 35% 以上,但術後併發症無法預測,且可能危及生命。在本文中,我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統,方法是設計 RECOVER,一個由 LLM 驅動的 RPM 系統,用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程,我們首先與五位臨床人員進行七場參與式設計會議,並訪談五位癌症患者,以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著,我們設計並實作 RECOVER,其特色在於一個由 LLM 驅動的對話式代理人,供癌症患者使用,以及一個互動式儀表板,供臨床人員使用,以進行有效的術後 RPM。最後,我們使用 RECOVER 作為試點系統,與四位臨床人員和五位患者評估我們設計策略的實作,並透過找出重要的設計元素、提供對負責任 AI 的見解,以及概述未來由 LLM 驅動的 RPM 系統的機會,提出設計意涵。 -##### **On Bob Dylan: A Computational Perspective** -2502.01772v1 by Prashant Garg +##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis** +2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander -Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style --- a constant refusal to conform to expectation and a penchant for reinventing -his musical and lyrical identity. In this paper, I extend Sunstein's -observations through a large-scale computational analysis of Dylan's lyrics -from 1962 to 2012. Using o3-mini-high (a large language model), I extract -concept-to-concept relationships from the lyrics and construct directed -knowledge graphs that capture Dylan's thematic structure. I then quantify -shifts in sentiment, metaphorical expression, thematic diversity, and network -complexity over time. The results indicate that Dylan's lyrics increasingly -rely on metaphor, display an evolving sentiment profile, and exhibit heightened -dishabituation -- measured here as a growing variance in the network centrality -of key concepts. I also find that references to movement, protest, and mythic -imagery fluctuate in ways that align with well-known phases of Dylan's career, -reflecting the dynamic and unpredictable quality of his art. These findings not -only deepen our empirical understanding of Sunstein's thesis but also introduce -a novel computational method for analyzing an artist's evolution-offering -broader applicability to the study of cultural and creative change. +Understanding the progression trajectories of diseases is crucial for early +diagnosis and effective treatment planning. This is especially vital for +life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a +chronic, progressive lung disease with a prognosis comparable to many cancers. +Computed tomography (CT) imaging has been established as a reliable diagnostic +tool for IPF. Accurately predicting future CT scans of early-stage IPF patients +can aid in developing better treatment strategies, thereby improving survival +outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial +Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF +patients at any time point. The model is trained using a two-stage approach. In +the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the +second stage, a Neural Ordinary Differential Equation (ODE) based temporal +model is trained to capture the temporal dynamics of the quantised embeddings +generated by the encoder in the first stage. We evaluate different +configurations of our model for generating longitudinal CT scans and compare +the results against ground truth data, both quantitatively and qualitatively. +For validation, we conduct survival analysis using imaging biomarkers derived +from generated CT scans and achieve a C-index comparable to that of biomarkers +derived from the real CT scans. The survival analysis results demonstrate the +potential clinical utility inherent to generated longitudinal CT scans, showing +that they can reliably predict survival outcomes. -摘要:卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格 --- 這種風格不斷拒絕符合預期,並熱衷於重新塑造他的音樂和歌詞認同。在本文中,我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析,來延伸桑斯坦的觀察。使用 o3-mini-high(一個大型語言模型),我從歌詞中提取概念對概念的關係,並建構有向知識圖,以捕捉迪倫的主題結構。然後,我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示,迪倫的歌詞越來越依賴隱喻,展現出不斷演化的情緒輪廓,並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現,對運動、抗議和神話意象的引用,會以與迪倫職業生涯中眾所周知階段一致的方式波動,反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解,也引入了分析藝術家演變的新穎運算方法,為文化和創造性變化的研究提供了更廣泛的適用性。 +摘要:了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要,IPF 是一種慢性、進行性肺部疾病,其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略,從而改善存活結果。在本文中,我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN),這是一個模型,能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段,訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段,訓練基於神經常微分方程 (ODE) 的時間模型,以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置,以生成縱向 CT 掃描,並在定量和定性方面將結果與真實數據進行比較。為了驗證,我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析,並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用,表明它們可以可靠地預測存活結果。 -##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos** -2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang +##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy** +2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho -Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in -enhancing Large Language Models (LLMs) through external knowledge integration, -yet its application has primarily focused on textual content, leaving the rich -domain of multi-modal video knowledge predominantly unexplored. This paper -introduces VideoRAG, the first retrieval-augmented generation framework -specifically designed for processing and understanding extremely long-context -videos. Our core innovation lies in its dual-channel architecture that -seamlessly integrates (i) graph-based textual knowledge grounding for capturing -cross-video semantic relationships, and (ii) multi-modal context encoding for -efficiently preserving visual features. This novel design empowers VideoRAG to -process unlimited-length videos by constructing precise knowledge graphs that -span multiple videos while maintaining semantic dependencies through -specialized multi-modal retrieval paradigms. Through comprehensive empirical -evaluation on our proposed LongerVideos benchmark-comprising over 160 videos -totaling 134+ hours across lecture, documentary, and entertainment -categories-VideoRAG demonstrates substantial performance compared to existing -RAG alternatives and long video understanding methods. The source code of -VideoRAG implementation and the benchmark dataset are openly available at: -https://github.com/HKUDS/VideoRAG. +The increasing demand for mental health services has led to the rise of +AI-driven mental health chatbots, though challenges related to privacy, data +collection, and expertise persist. Motivational Interviewing (MI) is gaining +attention as a theoretical basis for boosting expertise in the development of +these chatbots. However, existing datasets are showing limitations for training +chatbots, leading to a substantial demand for publicly available resources in +the field of MI and psychotherapy. These challenges are even more pronounced in +non-English languages, where they receive less attention. In this paper, we +propose a novel framework that simulates MI sessions enriched with the +expertise of professional therapists. We train an MI forecaster model that +mimics the behavioral choices of professional therapists and employ Large +Language Models (LLMs) to generate utterances through prompt engineering. Then, +we present KMI, the first synthetic dataset theoretically grounded in MI, +containing 1,000 high-quality Korean Motivational Interviewing dialogues. +Through an extensive expert evaluation of the generated dataset and the +dialogue model trained on it, we demonstrate the quality, expertise, and +practicality of KMI. We also introduce novel metrics derived from MI theory in +order to evaluate dialogues from the perspective of MI. -摘要:檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功,但其應用主要集中在文字內容上,而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG,這是第一個檢索增強生成架構,專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構,它無縫整合 (i) 基於圖形文字知識基礎,用於擷取跨影片語義關係,以及 (ii) 多模態語境編碼,用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片,同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估,該基準包含超過 160 部影片,總時數超過 134 小時,涵蓋演講、紀錄片和娛樂類別,VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比,展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於:https://github.com/HKUDS/VideoRAG。 +摘要:由於對心理健康服務的需求日益增加,導致以人工智慧為基礎的心理健康聊天機器人興起,儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而,現有的資料集顯示出訓練聊天機器人的限制,導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯,因為它們受到的關注較少。在本文中,我們提出了一個新穎的架構,它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型,它模擬了專業治療師的行為選擇,並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後,我們展示了 KMI,這是第一個理論上以 MI 為基礎的合成資料集,其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估,我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標,以便從 MI 的角度評估對話。 -##### **Transformers trained on proteins can learn to attend to Euclidean distance** -2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane +##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports** +2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco -While conventional Transformers generally operate on sequence data, they can -be used in conjunction with structure models, typically SE(3)-invariant or -equivariant graph neural networks (GNNs), for 3D applications such as protein -structure modelling. These hybrids typically involve either (1) -preprocessing/tokenizing structural features as input for Transformers or (2) -taking Transformer embeddings and processing them within a structural -representation. However, there is evidence that Transformers can learn to -process structural information on their own, such as the AlphaFold3 structural -diffusion model. In this work we show that Transformers can function -independently as structure models when passed linear embeddings of coordinates. -We first provide a theoretical explanation for how Transformers can learn to -filter attention as a 3D Gaussian with learned variance. We then validate this -theory using both simulated 3D points and in the context of masked token -prediction for proteins. Finally, we show that pre-training protein Transformer -encoders with structure improves performance on a downstream task, yielding -better performance than custom structural models. Together, this work provides -a basis for using standard Transformers as hybrid structure-language models. +Europe's healthcare systems require enhanced interoperability and +digitalization, driving a demand for innovative solutions to process legacy +clinical data. This paper presents the results of our project, which aims to +leverage Large Language Models (LLMs) to extract structured information from +unstructured clinical reports, focusing on patient history, diagnoses, +treatments, and other predefined categories. We developed a workflow with a +user interface and evaluated LLMs of varying sizes through prompting strategies +and fine-tuning. Our results show that fine-tuned smaller models match or +surpass larger counterparts in performance, offering efficiency for +resource-limited settings. A new dataset of 60,000 annotated English clinical +summaries and 24,000 German translations was validated with automated and +manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. +The work highlights the approach's viability and outlines future improvements. -摘要:雖然傳統的 Transformer 通常處理序列資料,但它們可用於結構模型,通常是 SE(3) 不變式或等變式圖神經網路 (GNN),用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而,有證據表明 Transformer 可以自行學習處理結構資訊,例如 AlphaFold3 結構擴散模型。在這項工作中,我們展示了 Transformer 在傳遞座標的線性嵌入時,可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後,我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能,產生比自訂結構模型更好的效能。綜合來說,這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。 +摘要:歐洲的醫療保健系統需要增強互通性和數位化,這驅動了對創新解決方案的需求,以處理傳統的臨床數據。本文介紹了我們專案的成果,該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊,重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程,並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示,微調後的較小模型在效能上與較大的模型相匹配或超越它們,為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性,並概述了未來的改進。